Singapore

Video object detection is a fundamental yet challenging task in computer vision. Recently, DETR-based methods have gained prominence in this domain owing to their powerful global modeling capabilities. However, these methods are still confronted with two key limitations: frame-agnostic initialization of object queries and scale-agnostic attention mechanisms, which hinder their capability to capture the appearance variations of dynamic objects and model the temporal consistency across frames. To alleviate these limitations, we propose a multiscale-aware transformer diffusion network (MSTDiff), a novel framework designed for the video object detection task, including two technical improvements over existing methods. First, we design a diffusion-driven adaptive query module, which models the object query distribution through a diffusion process conditioned on input frames, enabling an adaptive and content-aware initialization of object queries. Second, we develop a multiscale-aware transformer encoder module, which combines multi-head convolutional units with attention mechanisms to enhance multi-scale feature representations while preserving global dependence modeling. We conduct extensive experiments on the public ImageNet VID dataset, and the results demonstrate that our MSTDiff achieves 87.7% mAP with ResNet-101, outperforming previous state-of-the-art video object detection methods. The code will be made available.

AAAI 2026

MSTDiff: Multiscale-Aware Transformer Diffusion Network for Video Object Detection

spatiotemporal feature aggregation

multiscale-aware transformer encoder

diffusion model

video object detection

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

User feedback is critical for refining recommendation systems, yet explicit feedback (e.g., likes or dislikes) remains scarce in practice. As a more feasible alternative, inferring user preferences from massive implicit feedback has shown great potential (e.g., a user quickly skipping a recommended video usually indicates disinterest). Unfortunately, implicit feedback is often noisy: a user might skip a video due to accidental clicks or other reasons, rather than disliking it. Such noise can easily misjudge user interests, thereby undermining recommendation performance.
To address this issue, we propose a novel Group-aware User Behavior Simulation (G-UBS) paradigm, which leverages contextual guidance from relevant user groups, enabling robust and in-depth interpretation of implicit feedback for individual users.
Specifically, G-UBS operates via two key agents. First, the User Group Manager (UGM) effectively clusters users to generate group profiles utilizing a ``summarize-cluster-reflect" workflow based on LLMs. Second, the User Feedback Modeler (UFM) employs an innovative group-aware reinforcement learning approach, where each user is guided by the associated group profiles during the reinforcement learning process, allowing UFM to robustly and deeply examine the reasons behind implicit feedback. 
To assess our G-UBS paradigm, we have constructed a Video Recommendation benchmark with Implicit Feedback (IF-VR).
To the best of our knowledge, this is the first multi-modal benchmark for implicit feedback evaluation in video recommendation,
encompassing 15k users, 25k videos, and 933k interaction records with implicit feedback.
Extensive experiments on IF-VR demonstrate that G-UBS significantly outperforms mainstream LLMs and MLLMs, with a 4.0% higher proportion of videos achieving a play rate > 30% and 14.9% higher reasoning accuracy on IF-VR.

G-UBS: Towards Robust Understanding of Implicit Feedback via Group-Aware User Behavior Simulation

Recent studies have raised significant concerns regarding the reliability of current mathematics benchmarks, highlighting issues such as simplistic design and potential data contamination. Consequently, developing a reliable benchmark that effectively evaluates large language models' (LLMs) genuine capabilities in mathematical reasoning remains a critical challenge. To address these concerns, we propose RV-Bench, a novel evaluation methodology for Benchmarking LLMs with Random Variables in mathematical reasoning. Specifically, we build question-generating functions to produce random variable questions (RVQs), whose background content mirrors original benchmark problems, but with randomized variable combinations, rendering them "unseen" to LLMs. Models must completely understand the inherent question pattern to correctly answer RVQs with diverse variable combinations. Thus, an LLMs' genuine reasoning capability is reflected through its accuracy and robustness on RV-Bench. We conducted extensive experiments on over 30 representative LLMs across more than 1,000 RVQs. Our findings propose that LLMs exhibit a proficiency imbalance between encountered and "unseen" data distributions. Furthermore, RV-Bench reveals that proficiency generalization across similar mathematical reasoning tasks is limited, but we verified it can still be effectively elicited through test-time scaling.

Benchmarking LLMs’ Mathematical Reasoning with Unseen Random Variables Questions

Humans increasingly query Large Language Models (LLMs) to accomplish personal tasks according to their individual preferences. However, these preferences are often unconsciously veiled during conversation. To address this, LLMs must elicit human preferences through multi-turn dialogue, where tasks are accomplished via iterative clarifying questions and final response generated by LLMs as effective questioners. Existing approaches based on self-taught reasoning have two limitations: 1) they struggle to avoid generating irrelevant questions and 2) the final responses to tasks are misled by the conversations. To overcome these limitations, we propose TO-GATE, a novel framework that enhances question generation through trajectory optimization. TO-GATE comprises two key components: a clarification resolver, which generates optimal questioning trajectories to produce effective elicitation questions, and a summarizer, which ensures task-aligned final responses. Experimental results show that TO-GATE significantly outperforms baseline methods, achieving a 9.32% improvement on standard preference elicitation benchmarks.

TO-GATE: Clarifying Questions and Summarizing Responses with Trajectory Optimization for Eliciting Human Preference

Accurate 3D scene motion perception significantly enhances the safety and reliability of an autonomous driving system.
Benefiting from its all-weather operational capability and unique perceptual properties, 4D mmWave radar has emerged as an essential component in advanced autonomous driving.
However, sparse and noisy radar points often lead to imprecise motion perception, leaving autonomous vehicles with limited sensing capabilities when optical sensors degrade under adverse weather conditions.
In this paper, we propose RadarMP, a novel method for precise 3D scene motion perception using low-level radar echo signals from two consecutive frames. 
Unlike existing methods that separate radar target detection and motion estimation, RadarMP jointly models both tasks in a unified architecture, enabling consistent radar point cloud generation and pointwise 3D scene flow prediction.
Tailored to radar characteristics, we design specialized self-supervised loss functions guided by Doppler shifts and echo intensity, effectively supervising spatial and motion consistency without explicit annotations.
Extensive experiments on the public dataset demonstrate that RadarMP achieves reliable motion perception across diverse weather and illumination conditions, outperforming radar-based decoupled motion perception pipelines and enhancing perception capabilities for full-scenario autonomous driving systems.

RadarMP: Motion Perception for 4D mmWave Radar in Autonomous Driving

Reconstructing precise CAD modeling sequences from point clouds remains a challenging task, especially for objects with complex geometry and topology. In this paper, by formulating the CAD sequence reconstruction as a Markov decision process, we introduce ReACT, a novel Reward-informed Autoregressive decision Cad Transformer architecture for robust CAD sequence prediction. Beyond previous imitation-only approaches, our key innovation is to frame the CAD Transformer under a reinforcement learning paradigm and thereby integrate reward-inspired heuristic learning into our architecture. This allows ReACT to effectively leverage shape-aware long-term reward feedback to guide the inference of (nearly) optimal CAD commands. Specifically, conditioned on past tokens, comprising the historical CAD states, sketch-extrude commands (i.e., actions) and associated geometric rewards, ReACT autoregressively outputs the most promising CAD commands in a causal manner. In particular, we develop a novel scaffold-aware CAD state representation that integrates global point-command features with an incrementally constructed surface point scaffold, enabling fine-grained geometric reasoning for subsequent reconstruction prediction. Moreover, an effective local barrel points-guided dense reward function is designed to jointly evaluate surface fidelity and command efficiency for reliable reward guidance. Extensive evaluations on the DeepCAD and Fusion360 benchmarks demonstrate that ReACT can achieve superior CAD reconstruction quality, even for objects with complex shapes.

ReACT: Reward-informed Autoregressive Decision CAD Transformer

Person re-identification (Re-ID) under extremely low-light conditions suffers from severe image degradation, which significantly impairs the extraction of identity-discriminative features. Existing methods struggle to recover semantic information that is obscured under poor illumination. To better understand this problem, we conduct a comprehensive analysis of the semantic modeling behavior of Re-ID models in low-light settings. For the first time, we investigate the norm distributions of Query (Q), Key (K), and Value (V) vectors within the attention module and observe that, as illumination decreases, the norm of Query vectors in pedestrian regions drops significantly. This leads to dispersed attention and degraded feature representations. To address this issue, we propose a novel framework named Norm-Ratio Attention and Semantic Recovery Distillation Network(NRSRD), which consists of two key components: a Norm-Ratio Attention Module (NRA) and a Semantic Recovery Distillation Module(SRD). The former dynamically adjusts attention responses based on the ratio of K/Q vector norms, enhancing structural region perception while suppressing background interference. The latter transfers discriminative semantic knowledge from high-illumination auxiliary data to the low-light model, compensating for the semantic degradation caused by poor lighting. Extensive experiments on multiple publicly available low-light Re-ID benchmarks demonstrate the effectiveness and superiority of the proposed method.

Revisiting Attention in the Dark for Low-Light Person Re-Identiffcation

Knowledge distillation (KD) transfers the ``dark knowledge'' from a complex teacher model to a compact student model. However, heterogeneous architecture distillation, such as Vision Transformer (ViT) to ResNet18, faces challenges due to differences in spatial feature representations. Traditional KD methods are mostly designed for homogeneous architectures and hence struggle to effectively address the disparity. Although heterogeneous KD approaches have been developed recently to solve these issues, they often incur high computational costs and complex designs, or overly rely on logit alignment, which limits their ability to leverage the complementary features. To overcome these limitations, we propose Heterogeneous Complementary Distillation (HCD), a simple yet effective framework that integrates complementary teacher and student features to align representations in shared logits. These logits are decomposed and constrained to facilitate diverse knowledge transfer to the student. Specifically, HCD processes the student’s intermediate features through convolutional projector and adaptive pooling, concatenates them with teacher's feature from the penultimate layer and then maps them via the Complementary Feature Mapper (CFM) module, comprising fully connected layer, to produce shared logits. We further introduce Sub-logit Decoupled Distillation (SDD) that partitions the shared logits into $n$ sub-logits, which are fused with teacher's logits to rectify classification. To ensure sub-logit diversity and reduce redundant knowledge transfer, we propose an Orthogonality Loss (OL). By preserving student-specific strengths and leveraging teacher knowledge, HCD enhances robustness and generalization in students. Extensive experiments on the CIFAR-100, fine-grained (e.g., CUB200, Aircraft) and ImageNet-1K datasets demonstrate that HCD outperforms state-of-the-art KD methods, establishing it as an effective solution for heterogeneous KD. The code will be publicly available.

Heterogeneous Complementary Distillation

The proliferation of generative image models has revolutionized AIGC creation while amplifying concerns over content provenance and manipulation forensics.
Existing methods are typically either unable to localize tampering or restricted to specific generative settings, limiting their practical utility.
We propose GenPTW, a General watermarking framework that unifies Provenance tracing and Tamper localization in latent space.
It supports both in-generation and post-generation embedding without altering the generative process, and is plug-and-play compatible with latent diffusion models (LDMs) and visual autoregressive (VAR) models.
To enable accurate tracing and tamper localization, we propose a dual-module design: a cross-attention fusion mechanism adaptively embeds watermark guided by latent features, while a spatial fusion module reinforces localization by injecting complete watermark information.
A tamper-aware extractor further unifies provenance and manipulation decoding, tightly coupling watermark semantics with forensic objectives.
Experiments show that GenPTW maintains high visual fidelity and strong robustness against diverse AIGC-editing.

GenPTW: Latent Image Watermarking for Provenance Tracing and Tamper Localization

Generating responsive listener head dynamics with nuanced emotions and expressive reactions is crucial for dialogue modeling in various virtual avatar animations. Previous studies mainly focus on the direct short-term production of listener behavior. They overlook the fine-grained control over motion variations and emotional intensity, especially in long-sequence modeling. Moreover, the lack of long-term and large-scale paired speaker-listener corpora incorporating head dynamics and fine-grained multi-modality annotations limits the application of dialogue modeling. Therefore, we first newly collect a large-scale multi-turn dataset of 3D dyadic conversation containing more than 1.4M valid frames for multi-modal responsive interaction, dubbed ListenerX. Additionally, we propose VividListener, a novel framework enabling fine-grained, expressive, and controllable listener dynamics modeling. This framework leverages multi-modal conditions as guiding principles for fostering coherent interactions between speakers and listeners. Specifically, we design the Responsive Interaction Module (RIM) to adaptively represent the multi-modal interactive embeddings. RIM ensures the listener dynamics achieve fine-grained semantic coordination with textual descriptions and adjustments, while preserving expressive reaction with speaker behavior. Meanwhile, we propose the Emotional Intensity Tags (EIT) for emotion intensity editing with multi-modal information integration, applying to both text descriptions and listener motion amplitude. Extensive experiments conducted on our newly collected ListenerX dataset demonstrate that VividListener achieves state-of-the-art performance, realizing expressive and controllable listener dynamics.

VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

Masked autoencoders (MAE) have become a dominant paradigm in 3D representation learning, setting new performance benchmarks across various downstream tasks. Existing methods with fixed mask ratios neglect multi-level representational correlations and intrinsic geometric structures, while relying on point-wise reconstruction assumptions that conflict with the diversity of point cloud. To address these issues, we propose a 3D representation learning method, termed Point-SRA, which aligns representations through self-distillation and probabilistic modeling. Specifically, we assign different masking ratios to the MAE to capture complementary geometric and semantic information, while the MeanFlow Transformer (MFT) leverages cross-modal conditional embeddings to enable diverse probabilistic reconstruction. Our analysis further reveals that representations at different time steps in MFT also exhibit complementarity. Therefore, a Dual Self-Representation Alignment mechanism is proposed at both the MAE and MFT levels. Finally, we design a Flow-Conditioned Fine-Tuning Architecture to fully exploit the point cloud distribution learned via MeanFlow. Point-SRA outperforms Point-MAE by 5.37% on ScanObjectNN. On intracranial aneurysm segmentation, it reaches 96.07% mean IoU for arteries and 86.87% for aneurysms. For 3D object detection, Point-SRA achieves 47.3% AP@50, surpassing MaskPoint by 5.12%.

Content not yet available

Next from AAAI 2026

G-UBS: Towards Robust Understanding of Implicit Feedback via Group-Aware User Behavior Simulation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES