Singapore

Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Current robustness enhancement methods rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To support this methodology, we introduce a novel 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual effects, and pristine semantic reasoning. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMBench, MMStar, and RealWorldQA.
We will release our code, demo, and dataset soon.

AAAI 2026

Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

mllms

reasoning

robustness

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Recently Multimodal Large Language Models (MLLMs) have achieved considerable advancements in vision-language tasks, yet produce potentially harmful or untrustworthy content. Despite substantial work investigating the trustworthiness of language models, MMLMs' capability to act honestly, especially when faced with visually unanswerable questions, remains largely underexplored. This work presents the first systematic assessment of honesty behaviors across various MLLMs. We ground honesty in models' response behaviors to unanswerable visual questions, define four representative types of such questions, and construct MoHoBench, a large-scale MMLM honest benchmark, consisting of 12k+ visual question samples, whose quality is guaranteed by multi-stage filtering and human verification. Using MoHoBench, we benchmarked the honesty of 28 popular MMLMs and conducted a comprehensive analysis. Our findings show that: (1) most models fail to appropriately refuse to answer when necessary, and (2) MMLMs' honesty is not solely a language modeling issue, but is deeply influenced by visual information, necessitating the development of dedicated methods for multimodal honesty alignment. Therefore, we implemented initial alignment methods using supervised and preference learning to improve honesty behavior, providing a foundation for future work on trustworthy MLLMs.

MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions

Point-of-Interest (POI) recommendation plays a pivotal role in location-based services by guiding users to discover new and relevant places. While graph-based methods have shown promising results, effectively modeling the diversity and dynamics of user preferences remains a key challenge. Addressing this requires richer representations of both POIs and user interests, as well as more adaptive learning strategies.
In this work, we propose TMHKG, a Task-aware Meta-learning framework with a Heterogeneous Knowledge Graph for POI recommendation. To enhance representation learning, TMHKG constructs a dual-view POI knowledge graph that integrates geographical proximity and user-aware category transitions, and models users' evolving interests from sequential visit histories. On top of enriched features, TMHKG adopts a task-aware meta-learning paradigm, treating each user's recommendation task as a separate meta-task. A generalizable recommendation policy is first learned from diverse training tasks and then quickly adapted to each user's unique behavior, enabling highly personalized predictions.
Extensive experiments on two real-world datasets demonstrate that TMHKG consistently outperforms state-of-the-art baselines, highlighting its effectiveness in capturing complex user-POI interactions.

Task-Aware Meta-Learning on Heterogeneous Knowledge Graph for POI Recommendation

Pre-trained Vision-Language Models (VLMs), e.g. CLIP,
have become essential tools in multimodal transfer learn-
ing. However, fine-tuning VLMs in few-shot scenarios poses
significant challenges in balancing task-specific adaptation
and generalization in the obtained model. Meanwhile, cur-
rent researches have predominantly focused on prompt-based
adaptation methods, leaving adapter-based approaches un-
derexplored and revealing notable performance gaps. To ad-
dress these challenges, we introduce a novel Reconstruction-
based Multimodal Adapter (RMAdapter), which leverages a
dual-branch architecture. Unlike conventional single-branch
adapters, RMAdapter consists of: (1) an adaptation branch
that injects task-specific knowledge through parameter-
efficient fine-tuning, and (2) a reconstruction branch that pre-
serves general knowledge by reconstructing latent space fea-
tures back into the original feature space. This design facil-
itates a dynamic balance between general and task-specific
knowledge. Importantly, although RMAdapter introduces an
additional reconstruction branch, it is carefully optimized
to remain lightweight. By computing reconstruction loss lo-
cally at each layer and sharing projection modules, the over-
all computational overhead is kept minimal. A consistency
constraint is also incorporated to better regulate the trade-
off between discriminability and generalization. We compre-
hensively evaluate the effectiveness of RMAdapter on three
representative tasks: generalization to new categories, gen-
eralization to new target datasets, and domain generalization.
Without relying on data augmentation or duplicate prompt de-
signs, our RMAdapter consistently outperforms state-of-the-
art approaches across all evaluation metrics.

RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models

Efficiently and accurately determining the symmetry is a crucial step in the structural analysis of crystalline materials. Existing methods usually mindlessly apply deep learning models while ignoring the underlying chemical rules. More importantly, experiments show that they face a serious sub-property confusion SPC problem. To address the above challenges, from a decoupled perspective, we introduce the XRDecoupler framework, a problem-solving arsenal specifically designed to tackle the SPC problem. Imitating the thinking process of chemists, we innovatively incorporate multidimensional crystal symmetry information as superclass guidance to ensure that the model's prediction process aligns with chemical intuition. We further design a hierarchical PXRD pattern learning model and a multi-objective optimization approach to achieve high-quality representation and balanced optimization. Comprehensive evaluations on three mainstream databases (e.g., CCDC, CoREMOF, and InorganicData) demonstrate that XRDecoupler excels in performance, interpretability, and generalization. The code for our method is available in Supplement.

Rethinking Crystal Symmetry Prediction: A Decoupled Perspective

Object 6D pose estimation is a challenging task that is crucial for robotics and augmented reality applications, particularly when dealing with novel objects. A promising direction is single-reference-based estimation, which requires only a single annotated view instead of a full 3D model. However, existing methods rely on dense correspondence regression, which suffers from limited global consistency due to the local nature of convolutional architectures, and faces challenges in symmetric or occluded scenarios due to deterministic predictions.
We present CoordAR, a novel autoregressive framework for single-reference 6D pose estimation of unseen objects. CoordAR formulates 3D-3D correspondences between the reference and query views as a discretized coordinate map, which is decoded autoregressively in a probabilistic manner. To enable accurate correspondence regression, CoordAR introduces: 1) a novel coordinate map tokenization enabling probabilistic prediction over discretized 3D space; 2) a decoupled encoding strategy that separately encodes RGB appearance and coordinate cues; and 3) an autoregressive transformer decoder conditioned on both pixel-aligned query features and the partially generated coordinate sequence.
Thanks to the novel designs, CoordAR significantly outperforms existing methods on multiple benchmarks and demonstrates strong robustness to symmetry, occlusion, and other challenges in real-world tests, while requiring only a single reference view.

CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

Modular design of planning-oriented autonomous driving has markedly advanced end-to-end systems. However, existing architectures remain constrained by an over-reliance on ego status, hindering generalization and robust scene understanding. We identify the root cause as an inherent design within these architectures that allows ego status to be easily leveraged as a shortcut. Specifically, the premature fusion of ego status in the upstream BEV encoder allows an information flow from this strong prior to dominate the downstream planning module. To address this challenge, we propose AdaptiveAD, an architectural-level solution based on a multi-context fusion strategy. Its core is a dual-branch structure that explicitly decouples scene perception and ego status. One branch performs scene-driven reasoning based on multi-task learning, but with ego status deliberately omitted from the BEV encoder, while the other conducts ego-driven reasoning based solely on the planning task. A scene-aware fusion module then adaptively integrates the complementary decisions from the two branches to form the final planning trajectory. To ensure this decoupling does not compromise multi-task learning, we introduce a path attention mechanism for ego-BEV interaction and add two targeted auxiliary tasks: BEV unidirectional distillation and autoregressive online mapping. Extensive evaluations on the nuScenes dataset demonstrate that AdaptiveAD achieves state-of-the-art open-loop planning performance. Crucially, it significantly mitigates the over-reliance on ego status and exhibits impressive generalization capabilities across diverse scenarios. We will release the source code upon paper acceptance.

Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving

We consider the problem of modifying a description logic concept in light of models represented as pointed interpretations. We call this setting model change, and distinguish three main kinds of changes: eviction, which consists of only removing models; reception, which incorporates models; and revision, which combines removal with incorporation of models in a single operation. We introduce a formal notion of revision and argue that it does not reduce to a simple combination of eviction and reception, contrary to intuition. We provide positive and negative results on the compatibility of eviction and reception for EL-bottom and ALC description logic concepts and on
the compatibility of revision for ALC concepts.

Model Change for Description Logic Concepts

The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of $O(H \cdot N^2)$ that grows quadratically with the context size ($N$) and linearly with the number of heads ($H$). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from $H$ independent $O(N^2)$ computations into a single, collaborative $O(N^2)$ computation, fundamentally reducing complexity by a factor of $H$. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Extensive empirical validation on the OLMoE-1B-7B and 0.25B-1.75B model series demonstrates that while delivering an approximately two-fold increase in training throughput, its performance is on par with standard dense attention, even surpassing it on select key metrics, while consistently outperforming representative sparse attention methods including Longformer, Reformer, and BigBird across all evaluation metrics. Our work demonstrates that thoughtfully designed structural sparsity can serve as an effective inductive bias that simultaneously improves both computational efficiency and model performance, opening a new avenue for the architectural design of next-generation, high-performance LLMs.

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

Multilingual Alignment is an effective and representative paradigm to enhance LLMs' multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some research on language-specific neurons provides a new perspective to analyze and understand LLMs' mechanisms. However, we find that there are many neurons that are shared by multiple but not all languages and cannot be correctly classified. In this work, we propose a ternary classification methodology that categorizes neurons into three types, including language-specific neurons, language-related neurons, and language-agnostic neurons. And we propose a corresponding identification algorithm to distinguish these different types of neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs' internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ''Spontaneous Multilingual Alignment''. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights to better understand multilingual alignment and multilingual capabilities of LLMs.

How Does Alignment Enhance LLMs’ Multilingual Capabilities? A Language Neurons Perspective

Value decomposition is a central approach in multi-agent reinforcement learning (MARL), enabling centralized training with decentralized execution by factorizing the global value function into local values. To ensure individual-global-max (IGM) consistency, existing methods either enforce monotonicity constraints, which limit expressive power, or adopt softer surrogates at the cost of algorithmic complexity. In this work, we present a dynamical systems analysis of non-monotonic value decomposition, modeling learning dynamics as continuous-time gradient flow. We prove that, under approximately greedy exploration, all zero-loss equilibria violating IGM consistency are unstable saddle points, while only IGM-consistent solutions are stable attractors of the learning dynamics. Extensive experiments on both synthetic matrix games and challenging MARL benchmarks demonstrate that unconstrained, non-monotonic factorization reliably recovers IGM-optimal solutions and consistently outperforms monotonic baselines. Additionally, we investigate the influence of temporal-difference targets and exploration strategies, providing actionable insights for the design of future value-based MARL algorithms.

Downloads

Next from AAAI 2026

MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

MoHoBench: Assessing Honesty of Multimodal Large Language Models via Unanswerable Visual Questions

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads