Singapore

Visual Question Answering (VQA) requires models to reason over multimodal information, combining visual and textual data. With the development of continual learning, significant progress has been made in retaining knowledge and adapting to new information in the VQA domain. However, current methods often struggle with balancing knowledge retention, adaptation, and robust feature representation. To address these challenges, we propose a novel framework with adaptive memory allocation and global noise filtering called MacVQA for visual question answering. MacVQA fuses visual and question information while filtering noise to ensure robust representations, and employs prototype-based memory allocation to optimize feature quality and memory usage. These designs enable MacVQA to balance knowledge acquisition, retention, and compositional generalization in continual VQA learning. Experiments on ten continual VQA tasks show that MacVQA outperforms existing baselines, achieving 43.38% average accuracy and 2.32% average forgetting on standard tasks, and 42.53% average accuracy and 3.60% average forgetting on novel composition tasks.

AAAI 2026

MacVQA: Adaptive Memory Allocation and Global Noise Filtering for Continual Visual Question Answering

nlp: question answering

cv: image and video retrieval

cv: scene analysis & understanding

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The manifold hypothesis says that natural high-dimensional data lie on or around a low-dimensional manifold. The recent success of statistical and learning-based methods in very high dimensions empirically supports this hypothesis, suggesting that typical worst-case analysis does not provide practical guarantees. A natural step for analysis is thus to assume the manifold hypothesis and derive bounds that are independent of any ambient dimensions that the data may be embedded in. Theoretical implications in this direction have recently been explored in terms of generalization of ReLU networks and convergence of Langevin methods. In this work, we consider optimal uniform approximations with functions of finite statistical complexity. While upper bounds on uniform approximation exist in the literature using ReLU neural networks, we consider the opposite: lower bounds to quantify the fundamental difficulty of approximation on manifolds. In particular, we demonstrate that the statistical complexity required to approximate a class of bounded Sobolev functions on a compact manifold is bounded from below, and moreover that this bound is dependent only on the intrinsic properties of the manifold, such as curvature, volume, and injectivity radius.

Blessing of Dimensionality for Approximating Sobolev Classes on Manifolds

Recent advancements in instruction-based image editing methods have shown remarkable progress. However, current methods are often limited to simple instruction-based image editing, hindering real-world applications that usually encompass complex editing instructions.
In this work, we solve this from the perspective of architectural design, data, and evaluation protocols. Specifically, we identify the issue of insufficient instruction compliance and background inconsistency of previous models when performing this task. To this end, we propose MCIE-E1, a Multimodal Large Language Model-Driven Complex Instruction Image Editing framework that includes two key modules: a Spatial-Aware Cross Attention module and a Background-Consistent Cross Attention module. The former significantly improves instruction-following capability by explicitly aligning semantic instructions with spatial locations through the injection of spatial guidance across denoising timesteps. The latter enhances background features, thereby preserving consistency in unedited regions. To facilitate MCIE-E1 training, we propose a dedicated data construction pipeline to address the scarcity of datasets for complex instruction-based image editing. This pipeline integrates both fine-grained automatic filtering by a powerful MLLM and rigorous human filtering to ensure high-quality data. To evaluate MCIE-E1's capability of conducting complex instruction-based image editing, we introduce CIE-Bench, along with two new evaluation metrics. Experimental results on CIE-Bench demonstrate that MCIE-E1 surpasses the previous state-of-the-art method in both quantitative and qualitative evaluations, achieving 23.96\% improvement in instruction compliance.

MCIE: Multimodal LLM-Driven Complex Instruction Image Editing with Spatial Guidance

Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding. 
However, they exhibit inferior performance on tasks regarding fine-grained visual perception. We attribute this to the inner limitations of ViTs in capturing diverse visual semantic levels.
To address this, we present Hierarchical window (Hiwin) transformer as a plug-and-play solution for MLLMs, centered around our inverse semantic pyramid (ISP). 
Hiwin transformer comprises two key modules:
(i) a visual detail injection module, which progressively injects low-level visual details into high-level language-aligned semantics features, thereby constructing an ISP,
and 
(ii) a hierarchical window attention module, which leverages cross-scale windows to condense multi-level semantics from the ISP.
Notably, our design achieves an average boost of 3.7\% across 14 benchmarks compared with the baseline method, 9.3\% on DocVQA for instance.

LLaVA-UHD v2: Exploiting Hierarchical Vision Granularity in MLLMs via Inverse Semantic Pyramid

Virtual screening (VS) is an essential task in drug discovery, focusing on the identification of small-molecule ligands that bind to specific protein pockets. Existing deep learning methods, from early regression models to recent contrastive learning approaches, primarily rely on structural data while overlooking protein sequences, which are more accessible and can enhance generalizability. However, directly integrating protein sequences poses challenges due to the redundancy and noise in large-scale protein-ligand datasets. To address these limitations, we propose S²Drug, a two-stage framework that explicitly incorporates protein Sequence information and 3D Structure context in protein-ligand contrastive representation learning. In the first stage, we perform protein sequence pretraining on ChemBL using an ESM2-based backbone, combined with a tailored data sampling strategy to reduce redundancy and noise on both protein and ligand sides. In the second stage, we fine-tune on PDBBind by fusing sequence and structure information through a residue-level gating module, while introducing an auxiliary binding site prediction task. This auxiliary task guides the model to accurately localize binding residues within the protein sequence and capture their 3D spatial arrangement, thereby refining protein-ligand matching. Across multiple benchmarks, S²Drug consistently improves virtual screening performance and achieves strong results on binding site prediction, demonstrating the value of bridging sequence and structure in contrastive learning.

S²Drug: Bridging Protein Sequence and 3D Structure in Contrastive Representation Learning for Virtual Screening

Dynamic driving scene reconstruction is of great importance in fields like digital twin system and autonomous driving simulation. However, unacceptable degradation occurs when the view deviates from the input trajectory, leading to corrupted background and vehicle models. To improve reconstruction quality on novel trajectory, existing methods are subject to various limitations including inconsistency, deformation, and time consumption. This paper proposes LidarPainter, a one-step diffusion model that recovers consistent driving views from sparse LiDAR condition and artifact-corrupted renderings in real-time, enabling high-fidelity lane shifts in driving scene reconstruction. Extensive experiments show that LidarPainter outperforms state-of-the-art methods in speed, quality and resource efficiency, specifically 7 × faster than StreetCrafter with only one fifth of GPU memory required. LidarPainter also supports stylized generation using text prompts such as “foggy” and “night”, allowing for a diverse expansion of the existing asset library.

LidarPainter: One-Step Away from Any Lidar View to Novel Guidance

Goal-driven persuasive dialogue, exemplified by applications like telemarketing, requires sophisticated multi-turn planning and strict factual faithfulness, which remains a significant challenge for even state-of-the-art Large Language Models (LLMs). Previous works are often limited by a lack of task-specific data, and direct LLM application suffers from strategic brittleness and factual hallucination. In this paper, we first construct and release TeleSalesCorpus, the first real-world-grounded dialogue dataset for this domain. We then propose AI-Salesman, a novel framework featuring a dual-stage architecture. For the training stage, we design a Bayesian-supervised reinforcement learning algorithm that learns robust sales strategies from noisy dialogues. For the inference stage, we introduce the Dynamic Outline-Guided Agent (DOGA), which leverages a pre-built script library to provide dynamic, turn-by-turn strategic guidance. Moreover, we design a comprehensive evaluation framework that combines fine-grained metrics for key sales skills with the LLM-as-a-Judge paradigm. Experimental results demonstrate that our proposed AI-Salesman significantly outperforms baseline models in both automatic metrics and comprehensive human evaluations, showcasing its effectiveness in complex persuasive scenarios. The complete source code and dataset for this paper will be made available upon publication.

AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing

Optimizing quantum programs is key to mitigating noise, reducing error-correction overhead, and improving performance on both near-term and fault-tolerant devices. Existing heuristic and learning-based optimizers, however, lack formal guarantees and risk semantic errors in the presence of entanglement and measurement. We present RelOpt, a semantics-preserving optimizer that enforces \emph{relational correctness} between original and optimized programs. RelOpt is built on a lightweight intermediate language (\texttt{QCore}) with a relational operational semantics supporting partial-trace equivalence, measurement-distribution preservation, and approximate correctness. Optimization is guided by a multi-objective cost model that considers gate count, circuit depth, and error-correction cost. Only rewrite rules that are formally verified against user-specified contracts are applied. The engine combines symbolic simulation, SMT reasoning, and cost analysis to achieve safe and effective optimizations. On standard benchmarks such as QFT, Grover, and QAOA, RelOpt consistently outperforms Qiskit, \texttt{t$|$ket$\rangle$}, and learning-based optimizers across multiple cost metrics while maintaining formal guarantees. By integrating formal verification with cost-aware compilation, RelOpt establishes a foundation for trustworthy and hardware-adaptive quantum toolchains.

Relational Verification for Cost-Aware Quantum Program Optimization

Most existing multi-modal trackers adopt uniform fusion strategies and propagate temporal information through mixed tokens, which fails to account for modality-specific differences and results in entangled temporal representations. To address these limitations, we propose a **Modality-aware fusion and Decoupled temporal propagation multi-modal object Tracking (MDTrack)**. 
Specifically, for modality-aware fusion, we allocate dedicated experts to each modality (Infrared, Event, Depth, and RGB) to process their respective representations. The gating mechanism within the mixture of experts (MoE) then dynamically selects the optimal experts based on the input features, enabling adaptive and modality-specific fusion.
For decoupled temporal propagation, we introduce two separate State Space Model (SSM) structures to independently store and update the hidden states $h$ of the RGB and X-modal streams, effectively capturing their distinct temporal information. To ensure synergy between the two temporal representations, we incorporate cross-attention between the input features of the two SSMs, facilitating implicit information exchange. The resulting temporally enriched features are then integrated into the backbone via cross-attention, enhancing MDTrack’s ability to leverage temporal information. 
**Extensive experiments demonstrate the effectiveness of our proposed MDTrack. Both MDTrack-S (Modality-Specific Training) and MDTrack-U (Unified-Modality Training) achieve state-of-the-art performance on five multi-modal tracking benchmarks.**

Exploring Modality-Aware Fusion and Decoupled Temporal Propagation for Multi-Modal Object Tracking

Compositional generalization has achieved substantial progress in computer vision on pre-collected training data. Nonetheless, real-world data continually emerges, with possible compositions being nearly infinite, long-tailed, and not entirely visible. Thus, an ideal model is supposed to gradually improve the capability of compositional generalization in an incremental manner. In this paper, we explore Composition-Incremental Learning for Compositional Generalization (CompIL) in the context of the compositional zero-shot learning (CZSL) task, where models need to continually learn new compositions, intending to improve their compositional generalization capability progressively. To quantitatively evaluate CompIL, we develop a benchmark construction pipeline leveraging existing datasets, yielding MIT-States-CompIL and C-GQA-CompIL. Furthermore, we propose a pseudo-replay framework utilizing a visual synthesizer to synthesize visual representations of learned compositions and a linguistic primitive distillation mechanism to maintain aligned primitive representations across the learning process. Extensive experiments demonstrate the effectiveness of the proposed framework.

Composition-Incremental Learning for Compositional Generalization

Categorical attributes with qualitative values are ubiquitous in cluster analysis of real datasets. Unlike the Euclidean distance of numerical attributes, the categorical attributes lack well-defined relationships of their possible values (also called categories interchangeably), which hampers the exploration of compact categorical data clusters. Although most attempts are made for developing appropriate distance metrics, they typically assume a fixed topological relationship between categories when learning distance metrics, which limits their adaptability to varying cluster structures and often leads to suboptimal clustering performance. This paper, therefore, breaks the intrinsic relationship tie of attribute categories and learns customized distance metrics suitable for flexibly and accurately revealing various cluster distributions. As a result, the fitting ability of the clustering algorithm is significantly enhanced, benefiting from the learnable category relationships. Moreover, the learned category relationships are proved to be Euclidean distance metric-compatible, enabling a seamless extension to mixed datasets that include both numerical and categorical attributes. Comparative experiments on 12 real benchmark datasets with significance tests show the superior clustering accuracy of the proposed method with an average ranking of 1.25, which is significantly higher than the 5.21 ranking of the current best-performing method.

Downloads

Next from AAAI 2026

Blessing of Dimensionality for Approximating Sobolev Classes on Manifolds

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Blessing of Dimensionality for Approximating Sobolev Classes on Manifolds

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads