Singapore

Attributed Question Answering (AQA) aims to enhance the reliability of AI-generated answers by including references for each statement, helping users to validate the provided information. However, existing work on AQA has primarily focused on text-only input, and has largely overlooked the role of multimodality. We introduce MAVis, a first benchmark designed to evaluate end-to-end systems on understanding user intent behind visual questions, retrieving evidence from multimodal documents, and generating answers with citations. Our dataset comprises 157K visual QA instances, where each answer is annotated with sentence-level citations referring to multimodal documents. We develop automatic metrics along three dimensions -- informativeness, groundedness, and fluency -- and demonstrate their strong correlation with human judgments. Our key findings are threefold: (1) LVLMs within multimodal RAG generate more informative and fluent answers than unimodal RAG but exhibit weak groundedness for image documents, a gap amplified in multimodal settings. (2) Given the same multimodal documents, there is a trade-off between informativeness and groundedness across different prompting methods. (3) Our proposed method highlights mitigating contextual bias in interpreting image documents as a crucial direction for future research.

AAAI 2026

MAVIS: A Benchmark for Multimodal Source Attribution in Long-form Visual Question Answering

nlp: question answering

ml: large multimodal models (lmms)

nlp: language grounding & multi-modal nlp

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The self-attention mechanism has been a key factor in the advancement of vision Transformers. However, its quadratic complexity imposes a heavy computational burden in high-resolution scenarios, restricting the practical application. Previous methods attempt to mitigate this issue by introducing handcrafted patterns such as locality or sparsity, which inevitably compromise model capacity. In this paper, we present a novel attention paradigm termed Circulant Attention by exploiting the inherent efficient pattern of self-attention. Specifically, we first identify that the self-attention matrix in vision Transformers often approximates the Block Circulant matrix with Circulant Blocks (BCCB), a kind of structured matrix whose multiplication with other matrices can be performed in $\mathcal{O}(N\log N)$ time. Leveraging this interesting pattern, we explicitly model the attention map as its nearest BCCB matrix and propose an efficient computation algorithm for fast calculation. The resulting approach closely mirrors vanilla self-attention, differing only in its use of BCCB matrices. Since our design is inspired by the inherent efficient paradigm, it not only delivers $\mathcal{O}(N\log N)$ computation complexity, but also largely maintains the capacity of standard self-attention. Extensive experiments on image classification, object detection, and semantic segmentation demonstrate the effectiveness of our approach. These results establish our circulant attention as a promising alternative to self-attention for vision Transformer architectures. Code will be released.

Vision Transformers Are Circulant Attention Learners

Recent works have sought to enhance the controllability and precision of text-driven motion generation. Some approaches leverage large language models (LLMs) to produce more detailed texts, while others incorporate global 3D coordinate sequences as additional control signals. However, the former often introduces misaligned details and lacks explicit temporal cues, and the latter incurs significant computational cost when converting coordinates to standard motion representations.
To address these issues, we propose FineXtrol, a novel control framework for efficient motion generation guided by temporally-aware, precise, and fine-grained textual control signals that describe specific body part movements over time. 
In support of this framework, we design a hierarchical contrastive learning module that encourages the text encoder to produce more discriminative embeddings for our novel control signals, thereby improving motion controllability.
Quantitative results show that FineXtrol achieves strong performance in controllable motion generation, while qualitative analysis demonstrates its flexibility in directing specific body part movements. 
Code will be released.

FineXtrol: Controllable Motion Generation via Fine-Grained Text

Multi-view 3D object detection has garnered increasing attention, particularly due to its success in autonomous driving systems. Although multi-view systems possess rich semantic information, their spatial-geometric reasoning capabilities remain limited. Recent studies employ simulated point cloud generation mechanisms to facilitate LiDAR-camera multi-modal knowledge distillation, achieving formal structural consistency. However, these methods still suffer from two major drawbacks i) alignment challenges arising from significant discrepancies between LiDAR and camera, and ii) prediction errors from simulated point cloud may degrade the extracted image semantic information during fusion. Accordingly, we propose adaptive-smooth distillation to optimize the granularity of the alignment based on the feature discrepancy for LiDAR-camera knowledge distillation. Specifically, this work considers both LiDAR to camera cross-modal distillation and LiDAR-camera fusion to simulated point cloud-camera fusion multi-modal distillation. Then, we introduce a heterogeneous fusion module to strategically bias the fusion process toward the extracted camera features, thereby enhancing the robustness of the fusion feature. Additionally, soft-weighted response distillation is proposed to facilitate the student model to selectively mimic the high-quality output of the teacher model. Extensive experiments have quantified the superiority of our method, achieving statistically significant improvements of 4.9% mAP and 4.5 % NDS.

Adaptive-Smooth LiDAR-Camera Knowledge Distillation with Heterogeneous Fusion for Multi-View 3D Object Detection

Large vision-language models (LVLMs) excel at visual understanding, but face efficiency challenges due to quadratic complexity in processing long multi-modal contexts. While token compression can reduce computational costs, existing approaches are designed for single-view LVLMs and fail to consider the unique multi-view characteristics of high-resolution LVLMs with dynamic cropping. Existing methods treat all tokens uniformly, but our analysis reveals that global thumbnails can naturally guide the compression of local crops by providing holistic context for informativeness evaluation.
In this paper, we first analyze dynamic cropping strategy, revealing both the complementary nature between thumbnails and crops, and the distinctive characteristics across different crops. Based on our observations, we propose ``Global Compression Commander'' (\textit{i.e.}, \textbf{GlobalCom$^2$}), a novel plug-and-play token compression framework for HR-LVLMs. GlobalCom$^2$ leverages thumbnail as the ``commander'' to guide the compression of local crops, adaptively preserving informative details while eliminating redundancy. 
Extensive experiments show that GlobalCom$^2$ maintains over \textbf{90\%} performance while compressing \textbf{90\%} visual tokens, reducing FLOPs and peak memory to \textbf{9.1\%} and \textbf{60\%} respectively. \textit{Code is available in supplementary materials.}

Global Compression Commander: Plug-and-Play Inference Acceleration for High-Resolution Large Vision-Language Models

Partially View-aligned Clustering (PVC) addresses the challenge of partial view alignment in multi-view learning by leveraging complementary and consistent information. While existing PVC methods show promise, most rely on distance-based strategies that are sensitive to view-specific details and noise, limiting their robustness. In this work, we propose a novel view alignment strategy that reformulates the alignment task as an anomaly detection problem. Rather than learning a view-alignment matrix that enforces strict one-to-one correspondences across views, we adopt a progressive approach to identify well-aligned samples. Specifically, we sample subsets of data by generating random view combinations from unaligned samples and propose an anomaly combination detection module to evaluate the alignment consistency of these combinations. In addition, our progressive training framework alternates between updating model parameters and selecting high-confidence view combinations for subsequent optimization. By reformulating view alignment as an anomaly detection task, our approach provides a more robust and effective solution to partial view alignment. Experiments on benchmark datasets demonstrate that our method outperforms state-of-the-art approaches in the PVC problem.

Learning Whom to Align With: Progressive Anomaly Combination Detection for Partially View-Aligned Clustering

Large vision-language models such as CLIP have shown strong zero-shot classification performance by aligning images and text in a shared embedding space. However, CLIP models often develop multimodal spurious biases, the undesirable tendency to rely on spurious features. For example, CLIP may infer object types in images based on frequently co-occurring backgrounds rather than the object's core features. This bias significantly impairs the robustness of pre-trained CLIP models on out-of-distribution data, where such cross-modal associations no longer hold. Existing methods for mitigating multimodal spurious bias typically require fine-tuning on downstream data or prior knowledge of the bias, which undermines the out-of-the-box usability of CLIP. In this paper, we first theoretically analyze the impact of multimodal spurious bias in zero-shot classification. Based on this insight, we propose Spuriousness-Aware Guided Exploration (SAGE), a simple and effective method that mitigates spurious bias via guided prompt selection. SAGE requires no training, fine-tuning, or external annotations. It explores on a space of prompt templates and selects the prompts that induces the largest semantic separation between classes, thereby improving worst-group robustness.
Extensive experiments on four real-world benchmark datasets and five popular backbone models demonstrate that SAGE consistently improves zero-shot performance and generalization, outperforming previous zero-shot approaches without any external knowledge or model updates.

SAGE: Spuriousness-Aware Guided Prompt Exploration for Mitigating Multimodal Bias

This paper focuses on the Continual Test-Time Adaptation (CTTA) task, aiming to enable an agent to continuously adapt to evolving target domains while retaining previously acquired domain knowledge for effective reuse when those domains reappear. Existing shared-parameter paradigms struggle to balance adaptation and forgetting, leading to decreased efficiency and stability. To address this, we propose a frequency-aware shared and self-adaptive expert framework, consisting of two key components: (i) a dual-branch expert architecture that extracts general features and dynamically models domain-specific representations, effectively reducing cross-domain interference and repetitive learning cost; and (ii) a online Frequency-aware Domain Discriminator (FDD), which leverages the robustness of low-frequency image signals for online domain shift detection, guiding dynamic allocation of expert resources for more stable and realistic adaptation. Additionally, we introduce a Continual Recurrent Shift (CRS) benchmark to simulate periodic domain changes for more realistic evaluation. 
Experimental results show that our method consistently outperforms existing approaches on both classification and segmentation CTTA tasks under standard and CRS settings, with ablations and visualizations confirming its effectiveness and robustness.

Shared & Domain Self-Adaptive Experts with Frequency-Aware Discrimination for Continual Test-Time Adaptation

Reliable zero-shot detection of out-of-distribution (OOD) inputs is critical for deploying vision-language models in open-world settings. However, the lack of labeled negatives in zero-shot OOD detection necessitates proxy signals that remain effective under distribution shift. Existing negative-label methods rely on a fixed set of textual proxies, which (i) sparsely sample the semantic space beyond in-distribution (ID) classes and (ii) remain static while only visual features drift, leading to cross-modal misalignment and unstable predictions. In this paper, we propose CoEvo, a training- and annotation-free test-time framework that performs bidirectional, sample-conditioned adaptation of both textual and visual proxies. Specifically, CoEvo introduces a proxy-aligned co-evolution mechanism to maintain two evolving proxy caches, which dynamically mines contextual textual negatives guided by test images and iteratively refines visual proxies, progressively realigning cross-modal similarities and enlarging local OOD margins. Finally, we dynamically re-weight the contributions of dual-modal proxies to obtain a calibrated OOD score that is robust to distribution shift. Extensive experiments on standard benchmarks demonstrate that CoEvo achieves state-of-the-art performance, improving AUROC by 1.33% and reducing FPR95 by 45.98% on ImageNet-1K compared to strong negative-label baselines.

Cross-modal Proxy Evolving for OOD Detection with Vision-Language Models

Endoscopic images often suffer from diverse and co-occurring degradations such as low lighting, smoke, and bleeding, which obscure critical clinical details. Existing methods are typically task-specific, and some require prior knowledge of the degradation type. To tackle this, we propose EndoIR, an all-in-one, degradation-agnostic diffusion-based framework that restores multiple degradation types using a single model. To provide informative task hints for guiding restoration across varying degradations, we introduce a Dual-Domain Prompter that extracts joint spatial-frequency features, coupled with an adaptive embedding that encodes shared and task-specific cues as conditioning for denoising. To avoid confusion of corrupted data distributions in the diffusion learning, EndoIR employs a Dual-Stream Diffusion design that separately processes separately encodes degraded and noise-scheduled inputs, followed by a Rectified Fusion Block that integrates them in a structured, degradation-aware manner. Furthermore, we propose a Noise-Aware Routing Decoder that improves efficiency by dynamically selecting relevant features during denoising. Experiments on SegSTRONG-C and CEC datasets demonstrate superior restoration quality of our EndoIR framework, with downstream segmentation confirming its clinical utility.

EndoIR: Degradation-Agnostic All-in-One Endoscopic Image Restoration via Noise-Aware Routing Diffusion

Large-scale scene data is essential for training and testing in robot learning. Neural reconstruction methods have promised the capability of reconstructing large physically-grounded outdoor scenes from captured sensor data. However, these methods have baked-in static environments and only allow for limited scene control -- they are functionally constrained in scene and trajectory diversity by the captures from which they are reconstructed. In contrast, generating driving data with recent image or video diffusion models offers control, however, at the cost of geometry grounding and causality. In this work, we aim to bridge this gap and present a method that directly generates large-scale 3D driving scenes with accurate geometry, allowing for causal novel view synthesis with object permanence and explicit 3D geometry estimation. The proposed method combines the generation of a proxy geometry and environment representation with score distillation from learned 2D image priors. We find that this approach allows for high controllability, enabling the prompt-guided geometry and high-fidelity texture and structure that can be conditioned on map layouts -- producing realistic and geometrically consistent 3D generations of complex driving scenes.

Downloads

Next from AAAI 2026

Vision Transformers Are Circulant Attention Learners

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Vision Transformers Are Circulant Attention Learners

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads