Singapore

Motion estimation in degraded scenes has long been a significant challenge, primarily attributed to substantial scene variations and insufficient training data. Existing approaches typically address this limitation by incorporating additional training strategies or modifying network architectures within conventional frameworks. However, these solutions not only require cumbersome training procedures or additional modal inputs, but also lack generalization capabilities. To address this problem, we propose a unified optical flow estimation framework specifically designed for degraded scenes. In this work, we employ large-scale pre-trained optical flow foundation models as both teacher and student networks. Our objective is to compensate for feature incompleteness during image degradation through pre-trained large models. Subsequently, we leverage supervised signals for fine-tuning and introduce an intra-inter frame distillation method to enable the student network to adapt to diverse cross-domain scenarios. Our proposed methodology provides deeper insights into learning style-invariant features from these learnable fine-tuning layers. Extensive experiments demonstrate that our approach achieves superior generalization performance and state-of-the-art results in degraded scenes (including low-light, rain, fog and other conditions) while requiring minimal training resources.

AAAI 2026

FlowAnyTime: Efficient Fine-tuning with Intra-Inter Frame Distillation for All-Weather Optical Flow Estimation

motion & tracking; scene analysis & understanding; video understanding & activity analysis

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance.

UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

Collaborative perception leveraging intermediate feature fusion has emerged as a leading paradigm to significantly enhance the environmental perception capabilities of autonomous driving systems. However, existing methods typically rely on discriminative supervision guided by downstream tasks. This paradigm compels models to learn minimal, task-specific representations, which conflicts with the goal of cooperative perception to capture comprehensive information, thereby limiting generalization. To address this issue, we propose DiGS-CP, a novel two-stage generative supervised collaborative perception framework. Specifically, we introduce a diffusion-based generative task that conditions on fused object-level features to generate representations of object-level point clouds. The proposed generative supervision provides fine-grained, task-agnostic signals that encourages the fusion module to learn comprehensive representations beyond task-specific requirements. By preserving and integrating complementary information from collaborative agents, our approach overcomes the limitations of task-specific learning and enhances the generalizability of the learned features. Furthermore, our two-stage architecture requires agents to transmit only object-level features, significantly reducing communication overhead. Extensive experiments on three benchmark datasets demonstrate that DiGS-CP achieves state-of-the-art performance in 3D object detection, while maintaining low bandwidth requirements and exhibiting excellent generalization ability. The model and code will be made publicly available.

From Discriminative to Generative: A Diffusion-Based Paradigm for Multi-Agent Collaborative Perception

We study the expressive power of graph neural networks (GNNs) with
mean as the aggregation function. In the non-uniform setting, we
show that such GNNs have exactly the same expressive power as ratio modal
logic, which has modal operators expressing that at least a
certain
ratio of the successors of a vertex satisfies a specified property.
The non-uniform expressive power of mean GNNs is thus higher than that of GNNs
with max aggregation, but lower than for sum
aggregation--the latter are characterized by modal logic and graded
modal logic, respectively. In the uniform setting, we show that the expressive power relative to MSO is exactly that of alternation-free modal logic, under the natural assumptions that combination functions are continuous and classification functions are thresholds.
This implies that, relative to MSO and in the uniform setting, mean
GNNs are strictly less expressive than sum GNNs and max
GNNs. When any of the assumptions is dropped, the expressive power increases.

Logical Characterizations of GNNs with Mean Aggregation

Non-Markovian Tasks (NMTs) are distinguished by their dependence on long-term memory and state-dependent dynamics, setting them apart from the traditional Markovian models typically employed in Reinforcement Learning (RL). NMTs not only suffer from reward sparseness but also rely on historical information, making their resolution considerably more challenging. In this paper, we propose a novel RL framework T4NMTD (Transition-centric framework for NMT Decomposition), designed specifically for learning NMTs which are specified by temporal logic. The core of T4NMTD is a task decomposition mechanism along with a parallel training approach for NMTs. An NMT is first decomposed as basic units based on the transitions of the automata which are derived from temporal logic formulae. The units are then modularized into sub-tasks according to their semantic similarity under logical interpretation. The training strategy of T4NMTD adopts a dual-level structure: the high-level learns to shape the boundaries and coordinate arrangement of the sub-tasks from a global perspective, while the low-level learns those sub-tasks in parallel. In addition, we invent a dynamic policy intervention scheme to mitigate the policy myopic issue during parallel training. A comprehensive evaluation is conducted on benchmark problems with respect to various metrics. The experimental results demonstrate that T4NMTD effectively addresses NMTs, achieving significant performance improvements compared with related studies.

T4NMTD: Transition-Centric Reinforcement Learning for Non-Markovian Task Decomposition

Amodal segmentation is an image-based algorithm that aims to predict masks for both visible and occluded parts of objects. Existing methods typically rely on supervised learning with annotated amodal masks or synthetic data. The effectiveness of these methods relies heavily on the quality of the datasets. This dependence can unintentionally restrict their generalization capabilities due to insufficient diversity and size. Although existing zero-shot methods perform well on their reported datasets, their performance does not necessarily transfer to other datasets. We propose a $\textbf{tuning-free}$ approach that re-purposes diffusion-based inpainting foundation models for amodal segmentation. Our approach is motivated by the “occlusion-free bias” of inpainting models, i.e., the inpainted objects tend to be complete and without occlusions. We reconstruct the occluded regions of an object via inpainting and then apply segmentation, all $\textbf{without additional training or fine-tuning}$. Experiments on five datasets, three previously unreported, demonstrate the generalizability of our approach. On average, our approach achieves 5.3% more accurate masks in mIoU compared to the publicly available state-of-the-art, pix2gestalt.

Tuning-Free Amodal Segmentation via the Occlusion-Free Bias of Inpainting Models

Model watermarking techniques can embed watermark information into the protected model for ownership declaration by constructing specific input-output pairs. However, existing watermarks are easily removed when facing model stealing attacks, and make it difficult for model owners to effectively verify the copyright of stolen models. In this paper, we analyze the root cause of the failure of current watermarking methods under model stealing scenarios and then explore potential solutions. Specifically, we introduce a robust watermarking framework, DeepTracer, which leverages a novel watermark samples construction method and a same-class coupling loss constraint. DeepTracer can incur a high-coupling model between watermark task and primary task that makes adversaries inevitably learn the hidden watermark task when stealing the primary task functionality. Furthermore, we propose an effective watermark samples filtering mechanism that elaborately select watermark key samples used in model ownership verification to enhance the reliability of watermarks. Extensive experiments across multiple datasets and models demonstrate that our method surpasses existing approaches in defending against various model stealing attacks, as well as watermark attacks, and achieves new state-of-the-art effectiveness and robustness.

DeepTracer: Tracing Stolen Model via Deep Coupled Watermarks

Recent advances in optimizing Gaussian Splatting for scene geometry have enabled efficient reconstruction of detailed surfaces from images. However, when input views are sparse, such optimization is prone to overfitting, leading to suboptimal reconstruction quality. Existing approaches address this challenge by employing flattened Gaussian primitives to better fit surface geometry, combined with depth regularization to alleviate geometric ambiguities under limited viewpoints. Nevertheless, the increased anisotropy inherent in flattened Gaussians exacerbates overfitting in sparse-view scenarios, hindering accurate surface fitting and degrading novel view synthesis performance. In this paper, we propose SparseSurf, a method that reconstructs more accurate and detailed surfaces while preserving high-quality novel view rendering. Our key insight is to introduce Stereo Geometry-Texture Alignment, which bridges rendering quality and geometry estimation, thereby jointly enhancing both surface reconstruction and view synthesis. In addition, we present a Pseudo-Feature Enhanced Geometry Consistency that enforces multi-view geometric consistency by incorporating both training and unseen views, effectively mitigating overfitting caused by sparse supervision. Extensive experiments on the DTU, BlendedMVS, and Mip-NeRF360 datasets demonstrate that our method achieves the state-of-the-art performance.

SparseSurf: Sparse-View 3D Gaussian Splatting for Surface Reconstruction

Automated auscultation advances the detection of respiratory diseases, especially in areas with limited resources where traditional diagnostic methods are unavailable. On the other hand, the scarcity of auscultation datasets limits the automation performance, prompting the needs for data augmentation methods. However, most of the existing methods neglect the difference in acoustic sounds that requires personalized augmentation strategies. To address this, we propose a Progressive-Adaptive Spectral Augmentation (PASA), which is one of the first paradigms to adaptively select the best augmentation strategy for each sample. The PASA innovatively treats augmentation selection problem as a Markov Decision Process (MDP), creating an alternating loop between the diagnostic model and the augmentation selection. The agent selects the optimal augmentation operations and magnitudes via a task-specific design, including state construction, action sampling, Hybrid Batch-Sample (HBS) strategy execution, and reward guidance. The HBS strategy initially applies uniform augmentation across mini-batches while collecting sample-specific performance statistics. When model performance stabilizes, it transits to sample-level augmentation based on accumulated difficulty assessments. This two-phase design balances computational complexity with personalization. Extensive experiments across three benchmark datasets demonstrate that the PASA outperforms the state-of-the-art methods, pioneering a transformative paradigm for adaptive data augmentation in automated auscultation.

PASA: Progressive-Adaptive Spectral Augmentation for Automated Auscultation in Data-Scarce Environments

Scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data, while it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this paper, we disentangle human-scene interaction from motion synthesis during training, and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. For long-term motion synthesis, we introduce motion blending in joint rotation power space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes.

Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

Label Distribution Learning (LDL) is a groundbreaking paradigm for addressing the task with label ambiguity. Subjectivity in annotating label description degrees often leads to imbalanced label distribution. Existing approaches either adopt representation alignment or decoupling strategies to solve the imbalanced label distribution learning (ILDL). However, representation alignment-based methods overlook the issue of gradient vanishing for non-dominant branches within imbalanced label distributions, while decoupling-based approaches fail to achieve adaptive weight optimization. To address these issues, we propose Adaptive Momentum and Exponential Moving Average weighted modeling (AMEMA). AMEMA combines EMA-based loss weighting with momentum allocation to mitigate gradient attenuation in non-dominant label learning and adaptively balance the optimization signals between dominant and non-dominant branches. Specifically, AMEMA decouples label distributions into two branches, dominant and non-dominant. It computes and updates Kullback-Leibler divergence losses for each branch using EMA. Different initial momenta are applied to each branch to facilitate branch-specific optimization dynamics. Dynamic weighting coefficients, derived from EMA-smoothed losses, allow the model to adjust its learning direction adaptively and improve the learning of non-dominant labels. Extensive experiments on benchmark datasets show that AMEMA consistently outperforms state-of-the-art ILDL methods across various evaluation metrics.

Content not yet available

Next from AAAI 2026

UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES