Singapore

Recent advances in deep learning-based 3D representation have achieved remarkable success, particularly in modeling static high-fidelity geometries. However, the extension of these techniques to dynamic 3D scenes introduces a critical challenge of effectively representing spatio-temporal dependencies, i.e., jointly modeling detailed spatial structures within frames and temporal dynamics across frames. To address this challenge, this paper proposes that the temporal evolution observed in dynamic 3D scenes is fundamentally attributable to the deformation of underlying spatial structures. To capture this relationship, we introduce a unified continuous 4D latent space representation incorporating a structure-equivalence prior, named SEP-4D. The core of SEP-4D is an efficient 4D tensor decomposition-fusion approach. This method fuses decomposed learnable 2D feature planes via a plane-wise spatio-temporal fusion mechanism of planar distributions, explicitly enforcing the principle that temporal evolution originates from geometric deformations of the 3D structure. To mitigate the associated computational demands, we sample the 3D probability volumes generated by VAE-based fusion into a spatio-temporally consistent 4D latent representation. The efficacy of our approach is validated through experiments on the fundamental task of 4D occupancy reconstruction. Extensive results demonstrate that, by leveraging the inherent equivalence of temporal dynamics and structural deformation, our method achieves high-quality reconstruction across various sequence lengths. Notably, for 4-frame scenes, we attain an impressive 91.68\% mIoU, significantly outperforming state-of-the-art baselines on standard benchmarks. The code will be publicly available.

AAAI 2026

The Structure-Equivalent Prior: Unifying Temporal Dynamics and 3D Evolution in 4D Latent Space

low level & physics-based vision

representation learning for vision

3d computer vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Video diffusion models have achieved impressive results in natural scene generation, yet they struggle to generalize to scientific phenomena such as fluid simulations and meteorological processes, where underlying dynamics are governed by scientific laws. These tasks pose unique challenges, including severe domain gaps, limited training data, and the lack of descriptive language annotations. To handle this dilemma, we extracted the latent scientific phenomena knowledge and further proposed a fresh framework that teaches video diffusion models to generate scientific phenomena from a single initial frame. Particularly, static knowledge is extracted via pre-trained masked autoencoders, while dynamic knowledge is derived from pre-trained optical flow prediction. Subsequently, based on the aligned spatial relations between the CLIP vision and language encoders, the visual embeddings of scientific phenomena, guided by latent scientific phenomena knowledge, are projected to generate the pseudo-language prompt embeddings in both spatial and frequency domains. By incorporating these prompts and fine-tuning the video diffusion model, we enable the generation of videos that better adhere to scientific laws. Extensive experiments on both computational fluid dynamics simulations and real-world typhoon observations demonstrate the effectiveness of our approach, achieving superior fidelity and consistency across diverse scientific scenarios.

Latent Knowledge-Guided Video Diffusion for Scientific Phenomena Generation from a Single Initial Frame

Multimodal Large Language Models (MLLMs) have recently achieved strong performance across a variety of multimodal tasks. However, they still suffer from various forms of hallucination, which hinder their practical deployment. Prior approaches often struggle to efficiently construct high-quality hallucination-related samples and to process them in a fine-grained manner, resulting in limited effectiveness in hallucination alleviation. To address this issue, we propose a data sampling strategy that selects samples better suited for hallucination-oriented training, thereby enhancing training effectiveness. In addition, we introduce a quantitative method for measuring hallucination severity and assign individualized weights to training samples accordingly. Building on this, we present Hallucination-Differentiated Direct Preference Optimization (HD-DPO), a novel preference optimization framework. During fine-tuning, HD-DPO incorporates these weights into both the formulation of customized loss functions and the modulation of localized visual attention, enabling fine-grained optimization. Experimental results demonstrate that our method outperforms existing fine-tuning strategies across multiple benchmarks and generalizes well to diverse MLLM architectures, effectively reducing hallucination rates and enhancing overall model performance.

Adaptive Hallucination Alleviation in Multimodal Large Language Models: From Strategic Data Selection to Severity-Guided Training

Reranking models are a crucial component in modern systems like Retrieval-Augmented Generation, tasked with selecting the most relevant documents prior to generation. However, current reranking approaches often face a fundamental trade-off. On one hand, Supervised Fine-Tuning methods that frame relevance as a binary classification task lack the necessary scoring discrimination, particularly for rerankers based on reasoning Large Language Models (LLMs). On the other hand, approaches designed for complex reasoning often employ powerful yet inefficient listwise formulations, rendering them impractical for low latency applications. To resolve this dilemma, we introduce ERank, a highly effective and efficient reranker built from a reasoning LLM that excels across diverse relevance scenarios. We propose a novel two-stage training pipeline that begins with Supervised Fine-Tuning (SFT). In this stage, we move beyond binary labels and train the model generatively to output fine grained integer scores, which significantly enhances relevance discrimination. The model is then further refined using Reinforcement Learning (RL) with a novel, listwise derived reward. This technique instills global ranking awareness into the efficient pointwise architecture. We evaluate the ERank-4B reranker on the BRIGHT, FollowIR, TREC DL, and BEIR benchmarks, demonstrating superior effectiveness and robustness compared to existing approaches. On the reasoning-intensive BRIGHT benchmark, our 4B-parameter reranker achieves an nDCG@10 of $38.7$, while a larger 32B variant reaches a state of the art nDCG@10 of $40.2$.

ERank: Fusing Supervised Fine-Tuning and Reinforcement Learning for Effective and Efficient Text Reranking

In online advertising under the cost-per-conversion (CPA) model, accurate conversion rate (CVR) prediction is crucial. A major challenge is delayed feedback, where conversions may occur long after user interactions, leading to incomplete recent data and biased model training. Existing solutions partially mitigate this issue but often rely on auxiliary models, making them computationally inefficient and less adaptive to user interest shifts. We propose IF-DFM, an \underline{I}nfluence \underline{F}unction-empowered for \underline{D}elayed \underline{F}eedback \underline{M}odeling which estimates the impact of newly arrived and delayed conversions on model parameters, enabling efficient updates without full retraining. By reformulating the inverse Hessian-vector product as an optimization problem, IF-DFM achieves a favorable trade-off between scalability and effectiveness. Experiments on benchmark datasets show that IF-DFM outperforms prior methods in both accuracy and adaptability.

Delayed Feedback Modeling with Influence Functions

Monitoring the elemental composition of materials in order to detect abnormal conditions in real-time is essential for applications like manufacturing quality control, environmental monitoring, and space exploration. This is achieved using sensors that analyze the interaction of a material with electromagnetic radiation, producing spectral data streams or a sequence of instances where each represents an ordered set of wavelengths with an associated intensity. While many unsupervised anomaly detection methods exist for tabular streaming data, their applicability to spectral streams remains underexplored. To address this gap, we consider our spectra in a multivariate stream setting and benchmark the performance of state-of-the-art tabular anomaly detection methods on this data. Furthermore, we introduce OnlineBootKNN, a novel unsupervised framework that combines k-nearest neighbors with online bootstrapping and a z-score test to detect anomalies in real-time. We demonstrate the high performance and robustness of our method, as well as the efficacy of the autoencoder-based method, KitNet, on newly simulated real-world spectral datasets. In addition, we compare their efficiency against the other tested techniques. Finally, we highlight the inherent interpretability of OnlineBootKNN, which is crucial for identifying the specific wavelengths, and thus elements, responsible for a detected anomaly.

OnlineBootKNN: An Unsupervised Framework for Detecting Anomalies in Spectral Data Streams

With the rapid development of the low-altitude economy, multimodal visual tracking in UAV scenarios has attracted extensive attention. UAVs are typically equipped with independent visible (RGB) and thermal infrared (TIR) sensors, resulting in an inherent spatial misalignment between the two modalities. However, existing RGBT tracking methods generally rely on spatially aligned data inputs, making them unsuitable for unaligned RGBT tracking task in UAV scenarios. In this work, we introduce the new task called unaligned UAV
RGBT tracking and construct the first large-scale unaligned RGB and TIR video dataset to promote the research and development of this field. The dataset contains 1,453 pairs of UAV-captured RGBT sequences with precise dual-modal bounding box annotations, and covers 42 object categories, 22 typical challenge attributes, and diverse spatial misalignment scales to better simulate real-world challenging scenarios. To address the limitations of existing methods that fail to handle the spatial misalignment issue in UAV scenarios, we propose the novel RGBT tracking approach. In particular, we design a mixture of shift estimation experts module to adaptively estimate the spatial shifts across two modalities at different scales, and a cross-modal alignment and fusion module to further compensate for nonlinear deformations and integrate multimodal information. Extensive experiments on the created dataset demonstrate that the proposed tracker significantly outperforms existing state-of-the-art tracking methods, validating its practicality and robustness in real-world unaligned UAV tracking scenarios.

Unaligned UAV RGBT Tracking: A Largescale Benchmark and a Novel Approach

Neural video compression (NVC) has demonstrated superior compression efficiency, yet effective rate control remains a significant challenge due to complex temporal dependencies. Existing rate control schemes typically leverage frame content to capture distortion interactions, overlooking inter-frame rate dependencies arising from shifts in per-frame coding parameters. This often leads to suboptimal bitrate allocation and cascading parameter decisions. To address this, we propose a reinforcement‑learning (RL)‑based rate control framework that formulates the task as a frame‑by‑frame sequential decision process. At each frame, an RL agent observes a spatiotemporal state and selects coding parameters to optimize a long‑term reward that reflects rate‑distortion (R-D) performance and bitrate adherence. Unlike prior methods, our approach jointly determines bitrate allocation and coding configuration in a single step, independent of group‑of‑pictures (GOP) structure. Extensive experiments across diverse NVC architectures show that our method reduces the average relative bitrate error to 1.20\% and achieves up to 13.45\% bitrate savings at typical GOP sizes, outperforming existing approaches. In addition, our framework demonstrates improved robustness to content variation and bandwidth fluctuations with lower encoding/decoding overhead, making it highly suitable for practical deployment.

Reinforced Rate Control for Neural Video Compression via Inter-Frame Rate–Distortion Awareness

Existing solutions for bundle recommendation(BR) have
achieved remarkable effectiveness for predicting the user’s
preference for prebuilt bundles. However, bundle-item(B-I)
affiliation will vary dynamically in real scenarios. For ex-
ample, a bundle themed as ‘casual outfit’ may add ‘hat’
or remove ‘watch’ due to factors such as seasonal varia-
tions, changes in user preferences or inventory adjustments.
Our empirical study demonstrates that the performance of
mainstream BR models will fluctuate or even decline re-
garding item-level variability. This paper makes the first at-
tempt to address the above problem and proposes a novel
Residual Diffusion for Bundle Recommendation(RDiffBR)
as a model-agnostic generative framework which can assist
a BR model in adapting this scenario. During the initial train-
ing of the BR model, RDiffBR employs a residual diffusion
model to process the item-level bundle embeddings which
are generated by BR model to represent bundle theme via
a forward-reverse process. In the inference stage, RDiffBR
reverses item-level bundle embeddings obtained by the well-
trained bundle model under B-I variability scenarios to gen-
erate the effective item-level bundle embeddings. In partic-
ular, the residual connection in our residual approximator
significantly enhances item-level bundle embeddings gener-
ation ability of BR models. Experiments on six BR models
and four public datasets from different domains show that
RDiffBR improves the performance of Recall and NDCG of
backbone BR models by up to 23%, while only increased
training time about 4%.

Modeling Item-Level Dynamic Variability with Residual Diffusion for Bundle Recommendation

Camouflage Images Generation (CIG) is an emerging research area that focuses on synthesizing images in which objects are harmoniously blended and exhibit high visual consistency with their surroundings. Existing methods perform CIG by either fusing objects into specific backgrounds or outpainting the surroundings via foreground object-guided diffusion. However, they often fail to obtain natural results because they overlook the logical relationship between camouflaged objects and background environments. To address this issue, we propose CT-CIG, a Controllable Text-guided Camouflage Images Generation method that produces realistic and logically plausible camouflage images. Leveraging Large Visual Language Models (VLM), we design a Camouflage-Revealing Dialogue Mechanism (CRDM) to annotate existing camouflage datasets with high-quality text prompts. Subsequently, the constructed image-prompt pairs are utilized to finetune Stable Diffusion, incorporating a lightweight controller to guide the location and shape of camouflaged objects for enhanced scenery fitness. Moreover, we design a Frequency Interaction Refinement Module (FIRM) to capture high-frequency texture features, facilitating the learning of complex camouflage patterns. Extensive experiments, including CLIPScore evaluation and camouflage effectiveness assessment, demonstrate the semantic alignment of our generated text prompts and CT-CIG's ability to produce photorealistic camouflage images. Code will be released soon.

Text-guided Controllable Diffusion for Realistic Camouflage Images Generation

Recent advances in zero-shot text-to-speech (TTS), driven by language models, diffusion models and masked generation, have achieved impressive naturalness in speech synthesis. Nevertheless, stability and fidelity remain key challenges, manifesting as mispronunciations, audible noise, and quality degradation. To address these issues, we introduce Vox-Evaluator, a multi-level evaluator designed to guide the correction of erroneous speech segments and preference alignment for TTS systems. It is capable of identifying the temporal boundaries of erroneous segments and providing a holistic quality assessment of the generated speech. Specifically, to refine erroneous segments and enhance the robustness of the zero-shot TTS model, we propose to automatically identify acoustic errors with the evaluator, mask the erroneous segments, and finally regenerate speech conditioning on the correct portions. In addition, the fine-gained information obtained from Vox-Evaluator can guide the preference alignment for TTS model, thereby reducing the bad cases in speech synthesize. Due to the lack of suitable training datasets for the Vox-Evaluator, we also constructed a synthesized text-speech dataset annotated with fine-grained pronunciation errors or audio quality issues. The experimental results demonstrate the effectiveness of the proposed Vox-Evaluator in enhancing the stability and fidelity of TTS systems through the speech correction mechanism and preference optimization.

Downloads

Next from AAAI 2026

Latent Knowledge-Guided Video Diffusion for Scientific Phenomena Generation from a Single Initial Frame

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Latent Knowledge-Guided Video Diffusion for Scientific Phenomena Generation from a Single Initial Frame

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads