Singapore

Multi-view 3D object detection has garnered increasing attention, particularly due to its success in autonomous driving systems. Although multi-view systems possess rich semantic information, their spatial-geometric reasoning capabilities remain limited. Recent studies employ simulated point cloud generation mechanisms to facilitate LiDAR-camera multi-modal knowledge distillation, achieving formal structural consistency. However, these methods still suffer from two major drawbacks i) alignment challenges arising from significant discrepancies between LiDAR and camera, and ii) prediction errors from simulated point cloud may degrade the extracted image semantic information during fusion. Accordingly, we propose adaptive-smooth distillation to optimize the granularity of the alignment based on the feature discrepancy for LiDAR-camera knowledge distillation. Specifically, this work considers both LiDAR to camera cross-modal distillation and LiDAR-camera fusion to simulated point cloud-camera fusion multi-modal distillation. Then, we introduce a heterogeneous fusion module to strategically bias the fusion process toward the extracted camera features, thereby enhancing the robustness of the fusion feature. Additionally, soft-weighted response distillation is proposed to facilitate the student model to selectively mimic the high-quality output of the teacher model. Extensive experiments have quantified the superiority of our method, achieving statistically significant improvements of 4.9% mAP and 4.5 % NDS.

AAAI 2026

Adaptive-Smooth LiDAR-Camera Knowledge Distillation with Heterogeneous Fusion for Multi-View 3D Object Detection

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large vision-language models (LVLMs) excel at visual understanding, but face efficiency challenges due to quadratic complexity in processing long multi-modal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to consider the unique multi-view characteristics of high-resolution LVLMs with dynamic cropping. Existing methods treat all tokens uniformly, but our analysis reveals that global thumbnails can naturally guide the compression of local crops by providing holistic context for informativeness evaluation.
In this paper, we first analyze dynamic cropping strategy, revealing both the complementary nature between thumbnails and crops, and the distinctive characteristics across different crops. Based on our observations, we propose ``Global Compression Commander'' (\textit{i.e.}, \textbf{GlobalCom$^2$}), a novel plug-and-play token compression framework for HR-LVLMs. GlobalCom$^2$ leverages thumbnail as the ``commander'' to guide the compression of local crops, adaptively preserving informative details while eliminating redundancy. 
Extensive experiments show that GlobalCom$^2$ maintains over \textbf{90\%} performance while compressing \textbf{90\%} visual tokens, reducing FLOPs and peak memory to \textbf{9.1\%} and \textbf{60\%} respectively. \textit{Code is available in supplementary materials.}

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

Partially View-aligned Clustering (PVC) addresses the challenge of partial view alignment in multi-view learning by leveraging complementary and consistent information. While existing PVC methods show promise, most rely on distance-based strategies that are sensitive to view-specific details and noise, limiting their robustness. In this work, we propose a novel view alignment strategy that reformulates the alignment task as an anomaly detection problem. Rather than learning a view-alignment matrix that enforces strict one-to-one correspondences across views, we adopt a progressive approach to identify well-aligned samples. Specifically, we sample subsets of data by generating random view combinations from unaligned samples and propose an anomaly combination detection module to evaluate the alignment consistency of these combinations. In addition, our progressive training framework alternates between updating model parameters and selecting high-confidence view combinations for subsequent optimization. By reformulating view alignment as an anomaly detection task, our approach provides a more robust and effective solution to partial view alignment. Experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in the PVC problem.

Learning Whom to Align With: Progressive Anomaly Combination Detection for Partially View-Aligned Clustering

Large vision-language models such as CLIP have shown strong zero-shot classification performance by aligning images and text in a shared embedding space. However, CLIP models often develop multimodal spurious biases, the undesirable tendency to rely on spurious features. For example, CLIP may infer object types in images based on frequently co-occurring backgrounds rather than the object's core features. This bias significantly impairs the robustness of pre-trained CLIP models on out-of-distribution data, where such cross-modal associations no longer hold. Existing methods for mitigating multimodal spurious bias typically require fine-tuning on downstream data or prior knowledge of the bias, which undermines the out-of-the-box usability of CLIP. In this paper, we first theoretically analyze the impact of multimodal spurious bias in zero-shot classification. Based on this insight, we propose Spuriousness-Aware Guided Exploration (SAGE), a simple and effective method that mitigates spurious bias via guided prompt selection. SAGE requires no training, fine-tuning, or external annotations. It explores on a space of prompt templates and selects the prompts that induces the largest semantic separation between classes, thereby improving worst-group robustness.
Extensive experiments on four real-world benchmark datasets and five popular backbone models demonstrate that SAGE consistently improves zero-shot performance and generalization, outperforming previous zero-shot approaches without any external knowledge or model updates.

SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias

This paper focuses on the Continual Test-Time Adaptation (CTTA) task, aiming to enable an agent to continuously adapt to evolving target domains while retaining previously acquired domain knowledge for effective reuse when those domains reappear. Existing shared-parameter paradigms struggle to balance adaptation and forgetting, leading to decreased efficiency and stability. To address this, we propose a frequency-aware shared and self-adaptive expert framework, consisting of two key components: (i) a dual-branch expert architecture that extracts general features and dynamically models domain-specific representations, effectively reducing cross-domain interference and repetitive learning cost; and (ii) a online Frequency-aware Domain Discriminator (FDD), which leverages the robustness of low-frequency image signals for online domain shift detection, guiding dynamic allocation of expert resources for more stable and realistic adaptation. Additionally, we introduce a Continual Recurrent Shift (CRS) benchmark to simulate periodic domain changes for more realistic evaluation. 
Experimental results show that our method consistently outperforms existing approaches on both classification and segmentation CTTA tasks under standard and CRS settings, with ablations and visualizations confirming its effectiveness and robustness.

Shared & Domain Self-Adaptive Experts with Frequency-Aware Discrimination for Continual Test-Time Adaptation

Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided by test images and iteratively refines visual proxies, progressively realigning cross-modal similarities and enlarging local OOD margins. Finally, we dynamically re-weight the contributions of dual-modal proxies to obtain a calibrated OOD score that is robust to distribution shift. Extensive experiments on standard benchmarks demonstrate that CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.

Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models

Endoscopic images often suffer from diverse and co-occurring degradations such as low lighting, smoke, and bleeding, which obscure critical clinical details. Existing methods are typically task-specific, and some require prior knowledge of the degradation type. To tackle this, we propose EndoIR, an all-in-one, degradation-agnostic diffusion-based framework that restores multiple degradation types using a single model. To provide informative task hints for guiding restoration across varying degradations, we introduce a Dual-Domain Prompter that extracts joint spatial-frequency features, coupled with an adaptive embedding that encodes shared and task-specific cues as conditioning for denoising. To avoid confusion of corrupted data distributions in the diffusion learning, EndoIR employs a Dual-Stream Diffusion design that separately processes separately encodes degraded and noise-scheduled inputs, followed by a Rectified Fusion Block that integrates them in a structured, degradation-aware manner. Furthermore, we propose a Noise-Aware Routing Decoder that improves efficiency by dynamically selecting relevant features during denoising. Experiments on SegSTRONG-C and CEC datasets demonstrate superior restoration quality of our EndoIR framework, with downstream segmentation confirming its clinical utility.

EndoIR: Degradation-Agnostic All-in-One Endoscopic Image Restoration via Noise-Aware Routing Diffusion

Large-scale scene data is essential for training and testing in robot learning. Neural reconstruction methods have promised the capability of reconstructing large physically-grounded outdoor scenes from captured sensor data. However, these methods have baked-in static environments and only allow for limited scene control -- they are functionally constrained in scene and trajectory diversity by the captures from which they are reconstructed. In contrast, generating driving data with recent image or video diffusion models offers control, however, at the cost of geometry grounding and causality. In this work, we aim to bridge this gap and present a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal novel view synthesis with object permanence and explicit 3D geometry estimation. The proposed method combines the generation of a proxy geometry and environment representation with score distillation from learned 2D image priors. We find that this approach allows for high controllability, enabling the prompt-guided geometry and high-fidelity texture and structure that can be conditioned on map layouts -- producing realistic and geometrically consistent 3D generations of complex driving scenes.

LSD-3D: Large-Scale 3D Driving Scene Generation with Geometry Grounding

Video Captioning aims to generate comprehensive and coherent descriptions of video content, contributing to the advancement of both video understanding and video generation. However, existing methods for generating video captions often suffer from Motion-Detail imbalance—models tend to overemphasize one aspect while neglecting the other. To address this issue, we propose solutions from two aspects: 1) Dataset aspect, we constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). Compared with previous video captioning datasets, HMD-270K features longer captions with more balanced and comprehensive motion-detail descriptions, directly mitigating the Motion-Detail imbalance problem. 2) Optimization aspect, we introduce the Caption Set Equivalence Reward (CSER) based on GRPO. Through a subset-to-set matching and bidirectional validation strategy. Compared with previous video captioning rewards, CSER adopts a more fine-grained approach to optimize the completeness and correctness of captions. Based on the HMD-270K and CSER post-training, we developed OwlCap, a powerful Video Captioning multi-modal large language model (MLLM) with Motion-Detail balance capabilities.
Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1 Score). Experiments on the downstream Text-to-Video (T2V) task further confirm OwlCap’s superior video captioning capability. The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.

OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

Generative models have shown remarkable performance in speech enhancement (SE), achieving superior perceptual quality over traditional discriminative approaches. However, existing generative SE approaches often overlook the risk of hallucination under severe noise, leading to incorrect spoken content or inconsistent speaker characteristics, which we term linguistic and acoustic hallucinations, respectively.
We argue that linguistic hallucination, stemming from models' failure to constrain valid phonological structures, is the more fundamental challenge.
While language models (LMs) are well-suited for capturing the underlying speech structure through modeling the distribution of discrete tokens, existing approaches are limited in learning from noise-corrupted representations, which can lead to contaminated priors and hallucinations.
To overcome these limitations, we propose the Phonologically Anchored Speech Enhancer (PASE), a generative SE framework that leverages the robust phonological prior embedded in the pre-trained WavLM model to mitigate hallucinations. 
First, we adapt WavLM into a denoising expert via representation distillation to clean its final-layer features. Guided by the model's intrinsic phonological prior, this process enables it to perform robust denoising with a strong resistance to hallucination.
To further reduce acoustic hallucinations, we train the vocoder with a dual-stream representation: the high-level phonetic representation provides clean linguistic content, while a low-level acoustic representation retains speaker identity and prosody.
Experimental results demonstrate that PASE not only surpasses state-of-the-art discriminative models in perceptual quality, but also significantly outperforms prior generative models with substantially lower linguistic and acoustic hallucinations.

PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement

Multi-view multi-label feature selection aims to identify informative features from heterogeneous views, where each sample is associated with multiple interdependent labels. This problem is particularly important in machine learning involving high-dimensional, multimodal data such as social media, bioinformatics or recommendation systems. Existing Multi-View Multi-Label Feature Selection (MVMLFS) methods mainly focus on analyzing statistical information of data, but seldom consider semantic information. In this paper, we aim to use these two types of information jointly and propose a method that combines Large Language Models (LLMs) semantic reasoning with Graph Neural Networks (GNNs) structural modeling for MVMLFS. Specifically, the method consists of three main components. (1) LLM is first used as an evaluation agent to assess the latent semantic relevance among feature, view, and label descriptions. (2) A semantic-aware heterogeneous graph with two levels is designed to represent relations among features, views and labels: one is a semantic graph representing semantic relations, and the other is a statistical graph. (3) A lightweight Graph Attention Network (GAT) is applied to learn node embedding in the heterogeneous graph as feature saliency scores for ranking and selection. Experimental results on multiple benchmark datasets demonstrate the superiority of our method over state-of-the-art baselines, and it is still effective when applied to small-scale datasets, showcasing its robustness, flexibility, and generalization ability.

Content not yet available

Next from AAAI 2026

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES