Singapore

Multimodal Large Language Models (MLLMs) have shown advanced performance in vision-language tasks. However, existing multimodal reasoning models often suffer from excessive reasoning steps, leading to high computational costs and inefficiency. In this paper, we propose the Multimodal Adaptive Reasoning Model (MARS), which enables adaptive adjustment of the reasoning strategy based on question difficulty. Specifically, MARS adopts a three-stage training framework based on our constructed training dataset (MART): 1) CoT Masking Learning to enhance reasoning logicality by predicting masked reasoning steps. 2) Adaptive Reasoning Instruction Learning to train the model to skip or keep reasoning steps according to difficulty levels. 3) CoT Lightweight Reinforcement Learning with the Information Bottleneck Principle based GRPO algorithm to reduce CoT length while maintaining performance and generalizability. Results on both in-domain and out-of-domain datasets show that MARS significantly reduces the CoT length (90.2% decrease) while improving accuracy (0.54%), outperforming existing SOTA open-source and proprietary MLLMs.

AAAI 2026

MARS: Multimodal Adaptive Reasoning Model for Avoiding Overthinking

cv: large vision models

nlp: question answering

cv: language and vision

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We introduce **LLaMMo** (**L**arge **La**nguage and **M**ulti-Person **Mo**tion Assistant), the first instruction-tuning multimodal framework tailored for multi-human motion analysis. LLaMMo incorporates a novel human-centric and social-temporal learner that models and fuses both intra-person dynamics and inter-person dependencies, yielding robust, context-aware representations of complex group behaviors while maintaining low computational overhead. To support LLaMMo, we construct **LLaVerse**, a large-scale dataset with fine-grained manual annotations covering diverse multi-person activities spanning daily social interaction and professional team sports. Built on top of LLaVerse, we also propose **LLaMI-Bench**, a dedicated benchmark for evaluating multi-human behavior understanding across motion and video modalities. Extensive experiments demonstrate that LLaMMo consistently outperforms baselines in understanding multi-person interactions under low-latency settings, with notable gains in both social and sport-specific contexts.

Multiple Human Motion Understanding

Predicting pedestrian motion trajectories is critical for the path planning and motion control of autonomous vehicles. Recent diffusion-based models have shown promising results in capturing the inherent stochasticity of pedestrian behavior for trajectory prediction. However, the absence of explicit semantic modelling of pedestrian intent in many diffusion-based methods may result in misinterpreted behaviors and reduced prediction accuracy. To address the above challenges, we propose a diffusion-based pedestrian trajectory prediction framework that incorporates both short-term and long-term motion intentions. Short-term intent is modelled using a residual polar representation, which decouples direction and magnitude to capture fine-grained local motion patterns. Long-term intent is estimated through a learnable, token-based endpoint predictor that generates multiple candidate goals with associated probabilities, enabling multimodal and context-aware intention modelling. Furthermore, we enhance the diffusion process by incorporating adaptive guidance and a residual noise predictor that dynamically refines denoising accuracy. The proposed framework is evaluated on the widely used ETH, UCY, and SDD benchmarks, demonstrating competitive results against state-of-the-art methods. Code is available at https://anonymous.4open.science/r/IAD/.

Intention-Aware Diffusion Model for Pedestrian Trajectory Prediction

Multi-Domain Multi-Task (MDMT) recommendation aims to provide personalized recommendations by leveraging information across multiple domains and tasks. However, existing methods often suffer from spurious correlations between irrelevant features and the target, leading to negative transfer. To address this, we propose a Stable and Adaptive Fusion (SAF) framework for MDMT recommendation. SAF introduces a weighted Hilbert-Schmidt Independence Criterion (HSIC) loss to decorrelate irrelevant features from the target, learning sample weights that promote stable (i.e., robust to spurious correlations) representations in both bottom and expert layers. We employ Random Fourier Features (RFF) to enable scalable computation of the HSIC loss. We further employ adaptive feature and expert gating to select these stable features, enabling the model to capture intricate cross-domain and cross-task dependencies. The learned sample weights are also used to reweight the MDMT loss during training. Experiments on large-scale datasets show that SAF outperforms state-of-the-art baselines by up to 2\% in AUC. To facilitate further research, we release a new industrial dataset with 30 million interactions across 3 domains and 2 tasks, with 300 features.

Stable and Adaptive Fusion for Multi-domain Multi-task Recommendation

As valuable digital assets, deep neural networks necessitate robust ownership protection, positioning neural network watermarking (NNW) as a promising solution. 
Among various NNW approaches, weight-based methods are favored for their simplicity and practicality; however, they remain generally vulnerable to forging and overwriting attacks.
To address those challenges, we propose *NeuralMark*, a robust method built around a *hashed watermark filter*. 
Specifically, we utilize a hash function to generate an irreversible binary watermark from a secret key, which is then used as a filter to select the model parameters for embedding. 
This design cleverly intertwines the embedding parameters with the hashed watermark, providing a robust defense against both forging and overwriting attacks.
Average pooling is also incorporated to resist fine-tuning and pruning attacks.
Furthermore, it can be seamlessly integrated into various neural network architectures, ensuring broad applicability.
We theoretically analyze its security boundary and highlight the necessity of using a hashed watermark as a filtering mechanism.
Empirically, we demonstrate its effectiveness and robustness across 13 distinct Convolutional and Transformer architectures, covering five image classification tasks and one text generation task.

Hashed Watermark as a Filter: A Unified Defense Against Forging and Overwriting Attacks in Neural Network Watermarking

Visual Dialogue Navigation (VDN) aims to enable agents to reach target locations through dialogue with humans. The integration of VDN into Unmanned Aerial Vehicle (UAV) systems enhances human-machine interaction by enabling intuitive, hands-free operation, thereby unlocking vast applications. However, existing VDN models for UAVs can only perform navigation based on dialogue history, lacking proactive interaction capabilities to correct trajectories. Moreover, their sequential observation history recording mechanism struggles to accurately localize landmarks observed in the historical context, leading to ineffective utilization of referential information in new user instructions.
To address these, we present AerialVLA, an end-to-end UAV navigation framework integrating dialogue comprehension, action decision-making, and navigational question generation. AerialVLA comprises three core components: i) we propose the Progress-Driven Navigation-Query Alternation mechanism to determine optimal questioning timing through navigation progress estimation autonomously. ii) To effectively model long-horizon history observation sequences, we develop the History Spatial-Temporal Fusion module that extracts discriminative spatial-temporal representations from historical observations. iii) Furthermore, to overcome data scarcity in training, we devise the Online Task-Driven Augmentation strategy that enhances learning through action-conditioned data augmentation. Experimental results demonstrate that AerialVLA achieves state-of-the-art navigation performance while exhibiting effective dialogue capabilities.
Moreover, to better evaluate the agent's proactive dialogue and navigation abilities, our evaluation benchmark, named UAV Navigation with Online Dialogue (UNOD), incorporates an online dialogue interaction module. The UNOD assesses UAV agents' real-time questioning capabilities by leveraging an Air Commander Large Language Model to simulate human-UAV interactions during testing. Our code will be released.

AerialVLA: A Vision-Language-Action Model for Aerial Navigation with Online Dialogue

Recently, multi-modal embedding methods have flourished in entity alignment. As state-of-the-art approaches evolve rapidly, visual modality (i.e., images) missing emerges as a critical challenge. While visual modality typically offers the most informative signals in multi-modal entity alignment (MMEA), it is frequently unavailable for many entities. The existing methods commonly use dummy vectors to represent visual-missing embeddings, which negatively impacts both model training and inference. In this paper, we propose robust multi-modal entity alignment (rMMEA), which leverages ranking-based knowledge distillation and mutual information (MI) estimation to address missing modalities while enhancing noise robustness. Unlike conventional teacher-student distillation that requires the student to replicate teacher outputs, our rMMEA learns soft rankings from pure and complete modality sides while capturing implicit key semantics of teacher embeddings through mutual information maximization, allowing rMMEA to avoid strict point-to-point alignment. The experimental results across multiple benchmarks and settings demonstrate that rMMEA significantly outperforms the state-of-the-art anti-modality-missing methods in terms of effectiveness and efficiency.

rMMEA: Robust Multi-Modal Entity Alignment with Missing and Noise Visual Modality

We consider the problem of risk-sensitive control in a reinforcement learning (RL) framework. In particular, we aim
to find a risk-optimal policy by maximizing the distortion
riskmetric (DRM) of the discounted reward in a finite-horizon Markov decision process (MDP). DRMs are a rich class
of risk measures that include several well-known risk measures as special cases. We derive a policy Hessian theorem
for the DRM objective using the likelihood ratio method. Using this result, we propose a natural DRM Hessian estimator from sample trajectories of the underlying MDP. Next,
we present a cubic-regularized policy Newton algorithm for
solving this problem in an on-policy RL setting using estimates of the DRM gradient and Hessian. Our proposed algorithm is shown to converge to an ϵ-second-order stationary
point (ϵ-SOSP) of the DRM objective, and this guarantee ensures the escaping of saddle points. The sample complexity
of our algorithms to find an ϵ-SOSP is O(ϵ−3.5). Our experiments validate the theoretical findings. To the best of our
knowledge, our is the first work to present convergence to an
ϵ-SOSP of a risk-sensitive objective, while existing works in
the literature have either shown convergence to a first-order
stationary point of a risk-sensitive objective, or a SOSP of a
risk-neutral one.

Policy Newton Methods for Distortion Riskmetrics

Recently LLMs have faced increasing demands to selectively
remove specific information through Machine Unlearning. While evaluating unlearning effectiveness is crucial, exist-
ing benchmarks suffer from fundamental limitations in audit
dataset generation from unstructured corpora. We identify two
critical challenges: ensuring audit adequacy and handling knowledge redundancy between forget and retain datasets. Current approaches rely on ad-hoc question generation from
unstructured text, leading to unpredictable coverage gaps and
evaluation blind spots. Knowledge redundancy between for-get and retain corpora further obscures evaluation, making it difficult to distinguish genuine unlearning failures from
legitimately retained knowledge. To bring clarity to this chal-lenge, we propose LUCID, an automated framework that
leverages knowledge graphs to achieve comprehensive audit
dataset generation with fine-grained coverage and systematic redundancy elimination. By converting unstructured corpora into structured knowledge representations, it transforms the
ad-hoc audit dataset generation process into a transparent and
automated generation pipeline that ensures both adequacy
and non-redundancy. Applying LUCID to the MUSE bench-mark, we generated over 69,000 and 111,000 audit cases for
News and Books datasets respectively, identifying thousands of previously undetected knowledge memorization instances.
Our analysis reveals that knowledge redundancy significantly
skews metrics, artificially inflating ROUGE from 19.7% to
26.1% and Entailment Scores from 32.4% to 35.2%, highlight-
ing the necessity of deduplication for accurate assessment.

From Chaos to Clarity: A Knowledge Graph-Driven Audit Dataset Generation Framework for LLM Unlearning

Long-context processing remains a significant challenge for large language models (LLMs). Retrieval-augmented generation (RAG) has recently emerged as a promising approach, enabling LLMs to selectively access relevant information from extended contexts to improve efficiency. However, existing RAG approaches often lag behind other efficient long-context processing methods primarily due to inherent limitations on inaccurate retrieval and fragmented contexts. To address these limitations, we propose \textbf{RetroLM}, a novel RAG framework designed for effective long-context processing. Unlike traditional approaches, RetroLM introduces \textbf{KV-level retrieval augmentation}, which partitions the LLM's KV cache into contiguous pages and performs encoding and decoding operations based on the retrieved KV pages. Built upon this framework, we further develop a \textbf{specialized retriever} for precise retrieval of critical pages and conduct \textbf{unsupervised post-training} to optimize the model’s ability to leverage retrieved information. Compared with traditional RAG, the new approach enhances robustness to retrieval inaccuracy, facilitates effective utilization of fragmented contexts, and saves the cost from repeated context-encoding operations. We conduct extensive evaluations across several popular benchmarks, including LongBench, InfiniteBench, and RULER. RetroLM consistently outperforms existing long-LLMs and RAG-based methods, especially in tasks requiring deep reasoning or extreme context lengths. Our code and models will be released publicly to support future research in this area.

RetroLM: Retrieval-Augmented KVs for Long-Context Processing

Humans constantly generate a diverse range of tasks guided by internal motivations. While generative agents powered by large language models (LLMs) aim to simulate this complex behavior, it remains uncertain whether they operate on similar cognitive principles. To address this, we conducted a task-generation experiment comparing human responses with those of an LLM agent (GPT-4o). We find that human task generation is consistently influenced by psychological drivers, including personal values (e.g., Openness to Change) and cognitive style. Even when these psychological drivers are explicitly provided to the LLM, it fails to reflect the corresponding behavioral patterns. They produce tasks that are markedly less social, less physical, and thematically biased toward abstraction. Interestingly, while the LLM's tasks were perceived as more fun and novel, this highlights a disconnect between its linguistic proficiency and its capacity to generate human-like, embodied goals. We conclude that there is a core gap between the value-driven, embodied nature of human cognition and the statistical patterns of LLMs, highlighting the necessity of incorporating intrinsic motivation and physical grounding into the design of more human-aligned agents.

Downloads

Next from AAAI 2026

Multiple Human Motion Understanding

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Multiple Human Motion Understanding

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads