Singapore

Pre-trained Vision-Language Models (VLMs), e.g. CLIP,
have become essential tools in multimodal transfer learn-
ing. However, fine-tuning VLMs in few-shot scenarios poses
significant challenges in balancing task-specific adaptation
and generalization in the obtained model. Meanwhile, cur-
rent researches have predominantly focused on prompt-based
adaptation methods, leaving adapter-based approaches un-
derexplored and revealing notable performance gaps. To ad-
dress these challenges, we introduce a novel Reconstruction-
based Multimodal Adapter (RMAdapter), which leverages a
dual-branch architecture. Unlike conventional single-branch
adapters, RMAdapter consists of: (1) an adaptation branch
that injects task-specific knowledge through parameter-
efficient fine-tuning, and (2) a reconstruction branch that pre-
serves general knowledge by reconstructing latent space fea-
tures back into the original feature space. This design facil-
itates a dynamic balance between general and task-specific
knowledge. Importantly, although RMAdapter introduces an
additional reconstruction branch, it is carefully optimized
to remain lightweight. By computing reconstruction loss lo-
cally at each layer and sharing projection modules, the over-
all computational overhead is kept minimal. A consistency
constraint is also incorporated to better regulate the trade-
off between discriminability and generalization. We compre-
hensively evaluate the effectiveness of RMAdapter on three
representative tasks: generalization to new categories, gen-
eralization to new target datasets, and domain generalization.
Without relying on data augmentation or duplicate prompt de-
signs, our RMAdapter consistently outperforms state-of-the-
art approaches across all evaluation metrics.

AAAI 2026

RMAdapter: Reconstruction-based Multi-Modal Adapter for Vision-Language Models

ml: multimodal learning

ml: transfer

cv: language and vision

domain adaptation

multi-task learning

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Efficiently and accurately determining the symmetry is a crucial step in the structural analysis of crystalline materials. Existing methods usually mindlessly apply deep learning models while ignoring the underlying chemical rules. More importantly, experiments show that they face a serious sub-property confusion SPC problem. To address the above challenges, from a decoupled perspective, we introduce the XRDecoupler framework, a problem-solving arsenal specifically designed to tackle the SPC problem. Imitating the thinking process of chemists, we innovatively incorporate multidimensional crystal symmetry information as superclass guidance to ensure that the model's prediction process aligns with chemical intuition. We further design a hierarchical PXRD pattern learning model and a multi-objective optimization approach to achieve high-quality representation and balanced optimization. Comprehensive evaluations on three mainstream databases (e.g., CCDC, CoREMOF, and InorganicData) demonstrate that XRDecoupler excels in performance, interpretability, and generalization. The code for our method is available in Supplement.

Rethinking Crystal Symmetry Prediction: A Decoupled Perspective

Object 6D pose estimation is a challenging task that is crucial for robotics and augmented reality applications, particularly when dealing with novel objects. A promising direction is single-reference-based estimation, which requires only a single annotated view instead of a full 3D model. However, existing methods rely on dense correspondence regression, which suffers from limited global consistency due to the local nature of convolutional architectures, and faces challenges in symmetric or occluded scenarios due to deterministic predictions.
We present CoordAR, a novel autoregressive framework for single-reference 6D pose estimation of unseen objects. CoordAR formulates 3D-3D correspondences between the reference and query views as a discretized coordinate map, which is decoded autoregressively in a probabilistic manner. To enable accurate correspondence regression, CoordAR introduces: 1) a novel coordinate map tokenization enabling probabilistic prediction over discretized 3D space; 2) a decoupled encoding strategy that separately encodes RGB appearance and coordinate cues; and 3) an autoregressive transformer decoder conditioned on both pixel-aligned query features and the partially generated coordinate sequence.
Thanks to the novel designs, CoordAR significantly outperforms existing methods on multiple benchmarks and demonstrates strong robustness to symmetry, occlusion, and other challenges in real-world tests, while requiring only a single reference view.

CoordAR: One-Reference 6D Pose Estimation of Novel Objects via Autoregressive Coordinate Map Generation

Modular design of planning-oriented autonomous driving has markedly advanced end-to-end systems. However, existing architectures remain constrained by an over-reliance on ego status, hindering generalization and robust scene understanding. We identify the root cause as an inherent design within these architectures that allows ego status to be easily leveraged as a shortcut. Specifically, the premature fusion of ego status in the upstream BEV encoder allows an information flow from this strong prior to dominate the downstream planning module. To address this challenge, we propose AdaptiveAD, an architectural-level solution based on a multi-context fusion strategy. Its core is a dual-branch structure that explicitly decouples scene perception and ego status. One branch performs scene-driven reasoning based on multi-task learning, but with ego status deliberately omitted from the BEV encoder, while the other conducts ego-driven reasoning based solely on the planning task. A scene-aware fusion module then adaptively integrates the complementary decisions from the two branches to form the final planning trajectory. To ensure this decoupling does not compromise multi-task learning, we introduce a path attention mechanism for ego-BEV interaction and add two targeted auxiliary tasks: BEV unidirectional distillation and autoregressive online mapping. Extensive evaluations on the nuScenes dataset demonstrate that AdaptiveAD achieves state-of-the-art open-loop planning performance. Crucially, it significantly mitigates the over-reliance on ego status and exhibits impressive generalization capabilities across diverse scenarios. We will release the source code upon paper acceptance.

Decoupling Scene Perception and Ego Status: A Multi-Context Fusion Approach for Enhanced Generalization in End-to-End Autonomous Driving

We consider the problem of modifying a description logic concept in light of models represented as pointed interpretations. We call this setting model change, and distinguish three main kinds of changes: eviction, which consists of only removing models; reception, which incorporates models; and revision, which combines removal with incorporation of models in a single operation. We introduce a formal notion of revision and argue that it does not reduce to a simple combination of eviction and reception, contrary to intuition. We provide positive and negative results on the compatibility of eviction and reception for EL-bottom and ALC description logic concepts and on
the compatibility of revision for ALC concepts.

Model Change for Description Logic Concepts

The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of $O(H \cdot N^2)$ that grows quadratically with the context size ($N$) and linearly with the number of heads ($H$). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from $H$ independent $O(N^2)$ computations into a single, collaborative $O(N^2)$ computation, fundamentally reducing complexity by a factor of $H$. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Extensive empirical validation on the OLMoE-1B-7B and 0.25B-1.75B model series demonstrates that while delivering an approximately two-fold increase in training throughput, its performance is on par with standard dense attention, even surpassing it on select key metrics, while consistently outperforming representative sparse attention methods including Longformer, Reformer, and BigBird across all evaluation metrics. Our work demonstrates that thoughtfully designed structural sparsity can serve as an effective inductive bias that simultaneously improves both computational efficiency and model performance, opening a new avenue for the architectural design of next-generation, high-performance LLMs.

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

Multilingual Alignment is an effective and representative paradigm to enhance LLMs' multilingual capabilities, which transfers the capabilities from the high-resource languages to the low-resource languages. Meanwhile, some research on language-specific neurons provides a new perspective to analyze and understand LLMs' mechanisms. However, we find that there are many neurons that are shared by multiple but not all languages and cannot be correctly classified. In this work, we propose a ternary classification methodology that categorizes neurons into three types, including language-specific neurons, language-related neurons, and language-agnostic neurons. And we propose a corresponding identification algorithm to distinguish these different types of neurons. Furthermore, based on the distributional characteristics of different types of neurons, we divide the LLMs' internal process for multilingual inference into four parts: (1) multilingual understanding, (2) shared semantic space reasoning, (3) multilingual output space transformation, and (4) vocabulary space outputting. Additionally, we systematically analyze the models before and after alignment with a focus on different types of neurons. We also analyze the phenomenon of ''Spontaneous Multilingual Alignment''. Overall, our work conducts a comprehensive investigation based on different types of neurons, providing empirical results and valuable insights to better understand multilingual alignment and multilingual capabilities of LLMs.

How Does Alignment Enhance LLMs’ Multilingual Capabilities? A Language Neurons Perspective

Value decomposition is a central approach in multi-agent reinforcement learning (MARL), enabling centralized training with decentralized execution by factorizing the global value function into local values. To ensure individual-global-max (IGM) consistency, existing methods either enforce monotonicity constraints, which limit expressive power, or adopt softer surrogates at the cost of algorithmic complexity. In this work, we present a dynamical systems analysis of non-monotonic value decomposition, modeling learning dynamics as continuous-time gradient flow. We prove that, under approximately greedy exploration, all zero-loss equilibria violating IGM consistency are unstable saddle points, while only IGM-consistent solutions are stable attractors of the learning dynamics. Extensive experiments on both synthetic matrix games and challenging MARL benchmarks demonstrate that unconstrained, non-monotonic factorization reliably recovers IGM-optimal solutions and consistently outperforms monotonic baselines. Additionally, we investigate the influence of temporal-difference targets and exploration strategies, providing actionable insights for the design of future value-based MARL algorithms.

Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning

As AI becomes increasingly powerful and ubiquitous, it is disrupting skills and displacing workers. NSF’s National AI Institute for Adult Learning and Online Education (AI-ALOE) posits that AI can be part of the solution to the growing problem if we can use AI for reskilling, upskilling, and workforce development at scale. The long-term vision of AI-ALOE is to develop and use AI technologies to enhance the proficiency of online education for all adult learners, using in-person education as a benchmark. The day-to-day mission of AI-ALOE is to conduct responsible research into AI that is grounded in theories of human cognition and learning and derived from the scientific process of learning engineering. I will describe ongoing research at AI-ALOE.

AI for Reskilling, Upskilling, and Workforce Development

Small Data: A New Paradigm for the Next Generation of AI

Current algorithms for steering LLM behavior are often implemented for specific use cases and tasks. To help provide a more general purpose approach to steering model behavior, IBM has recently developed two toolkits: AI Steerability 360 (AISteer360) and In-Context Explainability 360 (ICX360). This hands-on lab will provide a comprehensive walkthrough of these toolkits.

Participants will first be guided through a conceptual overview of how to steer model behavior across four model control surfaces: input, structural, state, and output steering. Through a series of interactive coding sessions, attendees will implement steering methods on a running example focused on steering a model to produce less toxic outputs. The lab will demonstrate how to construct steering controls for fine-grained model intervention, use cases for specific model tasks, and benchmarks for comparing steering methods on a given use case.

Closing the loop, participants will learn how the ICX360 toolkit can be used to understand why a given (steered) model produced a given output, and how to use these insights to guide refined controls. The session will build progressively from concept to implementation, ensuring participants understand how to create an end-to-end steering workflow.

Prerequisites: Participants should be comfortable with Python programming and have some prior exposure to LLMs and natural language processing (prompting, fine-tuning, etc.). Experience using the Hugging Face transformers package will be beneficial but not required. Participants must bring their own laptop with a Python IDE installed.

Remote compute access will be facilitated through Hugging Face and Google Colab. All setup instructions and access details will be communicated to enrolled participants in advance via the lab’s website.

Downloads

Next from AAAI 2026

Rethinking Crystal Symmetry Prediction: A Decoupled Perspective

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Rethinking Crystal Symmetry Prediction: A Decoupled Perspective

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads