Singapore

Large Reasoning Language Models (LRMs) have recently shown remarkable performance in complex reasoning tasks, but their extensive reasoning chains incur substantial computational overhead. 
To address this challenge, we propose Outlier-aware Reasoning Conciseness Adaptive Merge (ORCA), a novel plug-and-play model merging framework that leverages outlier activation patterns to fuse base models with reasoning models. Our ORCA introduces three key innovations: (1) adaptive alignment that reduces conflicts between disparate activation patterns during merging, (2) outlier-guided allocation that assigns merging coefficients proportional to each layer&#39;s reasoning importance as indicated by outlier concentrations, and (3) dynamic probe-based adjustment that adapts merging coefficients during inference based on input-specific activation characteristics. These strategies allow seamless integration into existing merging pipelines while creating unified models that maintain reasoning accuracy with significantly reduced response verbosity. Comprehensive evaluation across six benchmarks using Qwen and LLaMA models shows ORCA reduces average response length by 55\% while improving accuracy by 2.4$\sim$5.7\% over existing methods. Code is in the supplemental.

AAAI 2026

Outlier Matters: Efficient Long-to-Short Reasoning via Outlier-Guided Model Merging

efficient ml / green ai

(large) language models

optimization

Large Reasoning Language Models (LRMs) have recently shown remarkable performance in complex reasoning tasks, but their extensive reasoning chains incur substantial computational overhead. 
To address this challenge, we propose Outlier-aware Reasoning Conciseness Adaptive Merge (ORCA), a novel plug-and-play model merging framework that leverages outlier activation patterns to fuse base models with reasoning models. Our ORCA introduces three key innovations: (1) adaptive alignment that reduces conflicts between disparate activation patterns during merging, (2) outlier-guided allocation that assigns merging coefficients proportional to each layer's reasoning importance as indicated by outlier concentrations, and (3) dynamic probe-based adjustment that adapts merging coefficients during inference based on input-specific activation characteristics. These strategies allow seamless integration into existing merging pipelines while creating unified models that maintain reasoning accuracy with significantly reduced response verbosity. Comprehensive evaluation across six benchmarks using Qwen and LLaMA models shows ORCA reduces average response length by 55\% while improving accuracy by 2.4$\sim$5.7\% over existing methods. Code is in the supplemental.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Object detection in sonar images is a key technology in underwater detection systems. Compared to natural images, sonar images contain fewer texture details and are more susceptible to noise, making it difficult for non-experts to distinguish subtle differences between classes. This leads to their inability to provide precise annotation data for sonar images. Therefore, designing effective object detection methods for sonar images with extremely limited labels is particularly important. To address this, we propose a teacher-student framework called RSOD, which aims to fully learn the characteristics of sonar images and develop a pseudo-label strategy suitable for these images to mitigate the impact of limited labels. First, RSOD calculates a reliability score by assessing the consistency of the teacher's predictions across different views. To leverage this score, we introduce an object mixed pseudo-label method to tackle the shortage of labeled data in sonar images. Finally, we optimize the performance of the student by implementing a reliability-guided adaptive constraint. By taking full advantage of unlabeled data, the student can perform well even in situations with extremely limited labels. Notably, on the UATD dataset, our method, using only 5% of labeled data, achieves results that can compete against those of our baseline algorithm trained on 100% labeled data. We also collected a new dataset to provide more valuable data for research in the field of sonar.

RSOD: Reliability-Guided Sonar Image Object Detection with Extremely Limited Labels

Mainstream multimodal large language models (MLLMs) rely on patch-based tokenization methods, which compromise the integrity of objects and thereby limit the model's perception capabilities while triggering object-related hallucinations. To address this issue, we propose ObjecTok, an innovative object tokenization framework. ObjecTok generates a single, holistic object token for each object in an image. This token is produced by a specially trained object encoder that embeds the object's semantic, positional, and shape information into a single compact representation, thereby preserving the object's integrity. To mitigate the imperfections of upstream object proposer models, we introduce learnable confidence embeddings. These embeddings enable the MLLM to learn the reliability of each object's information, significantly enhancing the model's robustness. Additionally, ObjecTok employs a hybrid input strategy, combining object tokens with traditional image patch tokens, allowing the model to leverage both object-level information and global scene context. By integrating ObjecTok into the LLaVA architecture, we achieve notable performance improvements on multiple object-centric benchmarks, effectively reducing object hallucinations and enhancing perception capabilities. Experimental results robustly demonstrate that the object tokens generated by our ObjecTok framework hold great potential for building more powerful and reliable MLLMs.

ObjecTok: Learning Holistic and Robust Object Tokens for MLLMs

Can Multimodal Large Language Models (MLLMs) discern confused objects that are visually present but audio-absent? To study this, we introduce a new benchmark, AV-ConfuseBench, which simulates an “Audio-Visual Confusion” scene by modifying the corresponding sound of an object in the video, e.g., mute the sounding object and ask MLLMs “Is there a/an {muted-object} sound”. Experimental results reveal that MLLMs, such as Qwen2.5-Omni and Gemini 2.5, struggle to discriminate non-existent audio due to visually dominated reasoning. Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM that is built upon the Qwen2.5-Omni foundation. RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning. Then, we design a Step-wise Reasoning Reward function that enables MLLMs to self-improve audio-visual reasoning with the audio-only reference. 2) To ensure an accurate answer prediction, we introduce Answer-centered Confidence Optimization to reduce the uncertainty of potential heterogeneous reasoning differences. Extensive experiments on audio-visual question answering and audio-visual hallucination show that RL-CoMM improves accuracy by 10~30% over the baseline model with limited training data.

When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?

Time series forecasting (TSF) plays a crucial role in many real-world applications, such as weather prediction and economic planning. While Transformer-based models have shown strong capabilities in modeling long-range dependencies, effectively capturing the multi-scale temporal dynamics inherent in time series remains a major challenge. Existing methods often adopt time-windows of varying sizes, which may introduce noisy or irrelevant representations when mismatched with the underlying temporal patterns, potentially leading to overfitting. In this paper, we propose Sparse-Scale Transformer (SSformer) with Bidirectional Awareness for Time Series Forecasting to enhance the multi-scale modeling for time series. Specifically, we propose a novel Sparse-Scale Convolution (SSC) block that imposes sparsity on scales to obtain the informative representations by evaluating the intra-scale segment similarity of time series, and utilizes scale-specific convolutions to extract local patterns. Furthermore, we design a Bidirectional-Scale Interaction (BSI) block to explicitly model scale correlations in both coarse-to-fine and fine-to-coarse directions. Finally, scale predictions are ensembled to fully exploit the complementary forecasting capabilities across scales. Extensive experiments on various real-world datasets demonstrate that SSformer achieves state-of-the-art performance with superior efficiency. Code is available at \url{https://github.com/yingliu-coder/SSformer}.

Sparse-Scale Transformer with Bidirectional Awareness for Time Series Forecasting

Efficient high-fidelity 3D content generation remains a challenging task, as state-of-the-art diffusion-based 3D generation models typically require dozens of sampling steps. Though few-step distillation methods such as Consistency Models (CMs) have achieved substantial advancements in accelerating 2D diffusion models, they face challenges when applied to the more complex 3D generation task. In this study, we propose a novel framework for few-step 3D diffusion distillation. Our approach is built upon the primary objective of learning the Marginal Probability Path (MPP), which is the probabilistic transport from the marginal distribution to the data distribution in a single step. To effectively optimize this objective, we propose two complementary loss functions: Velocity Matching (VM), which directly optimizes model parameters through a tractable objective derived from the primary objective, and Velocity Distillation (VD), which matches the student and teacher marginal distributions by leveraging the MPP as a measure. When evaluated on the state-of-the-art 3D generation framework TRELLIS, our method reduces sampling steps for each diffusion transformer from 25 to 1–2, while preserving high visual and geometric fidelity. Extensive experiments demonstrate that our method significantly outperforms existing CM distillation methods, and enables TRELLIS to achieve state-of-the-art performance in few-step 3D generation.

Few-step Flow for 3D Generation via Marginal-Data Transport Distillation

Cross-Video Reasoning (CVR) presents a significant challenge in video understanding, which requires simultaneous understanding of multiple videos to aggregate and compare information across groups of videos. Most existing video understanding benchmarks focus on single-video analysis, failing to assess the ability of multi-modal large language models (MLLMs) to simultaneously reason over various videos. Recent benchmarks evaluate MLLMs' capabilities on multi-view videos that capture different perspectives of the same scene. However, their limited tasks hinder a thorough assessment of MLLMs in diverse real-world CVR scenarios. To this end, we introduce CrossVid, the first benchmark designed to comprehensively evaluate MLLMs' spatial-temporal reasoning ability in cross-video contexts. Firstly, CrossVid encompasses a wide spectrum of hierarchical tasks, comprising four high-level dimensions and ten specific tasks, thereby closely reflecting the complex and varied nature of real-world video understanding. Secondly, CrossVid provides 5,331 videos, along with 9,015 challenging question-answering pairs, spanning single-choice, multiple-choice, and open-ended question formats. Through extensive experiments on various open-source and closed-source MLLMs, we observe that Gemini-2.5-Pro performs best on CrossVid, achieving an average accuracy of 50.4%. Notably, our in-depth case study demonstrates that most current MLLMs struggle with CVR tasks, primarily due to their inability to integrate or compare evidence distributed across multiple videos for reasoning. These insights highlight the potential of CrossVid to guide future advancements in enhancing MLLMs’ cross-video reasoning capabilities.

CrossVid: A Comprehensive Benchmark for Evaluating Cross-Video Reasoning in Multimodal Large Language Models

Recently, Test-Time Adaptation (TTA) has gained increasing attention in medical imaging due to its ability to improve model generalization under domain shifts without retraining. In particular, directly applying a well-trained model across various medical centers faces significant performance degradation caused by variations in equipment, operators, imaging conditions, and scanning skill levels of sonographers. Existing TTA methods either rely on parameter adaptation that increases computational cost or apply simple prediction fusion that ignores anatomical structure knowledge. To address these limitations, we propose a novel backward-free Topology-aware TTA framework named T^3 that integrates Structural Perception Modeling (SPM) and Box Regression Adaptation (BRA). SPM is implemented through an organ space heatmap generated via Gaussian kernel superposition. This heatmap encodes anatomical topology without requiring additional training or source data. BRA further improves localization and classification by fusing detection outputs based on the contribution of detected results to anatomically meaningful peak points from the heatmaps.
Extensive experiments were conducted across six cross-domain scenarios, and the results demonstrate that our method achieves state-of-the-art cross-domain detection performance while maintaining high efficiency, offering a practical and robust solution for real-world medical diagnostic applications. All source codes will be publicly available.

Topology-Inspired Backward-Free Framework for Test-Time Adaptation in Medical Detection

In this paper, we investigate the limitations of the Vector Quantized Latent Diffusion Model (VQ-LDM) in restoration tasks. We identify a performance gap between the Vector Quantization (VQ) and Diffusion Model components, manifested as a significant discrepancy between the reconstruction quality of ground truth images processed via VQ autoregression and degraded images restored by VQ-LDM. Through experiments, we attribute this gap primarily to the lack of robustness in the mapped points of VQ within the original VQ-LDM framework. To address this issue, we propose a geometric based optimization approach. First, we introduce a simple yet effective method, termed interpolation-based latent initial state optimization, which mitigates the performance gap by replacing the original mapped points with interpolated values, supported by theoretical analysis. Here, the latent initial state refers specifically to the input of the diffusion model. Building upon this, we further propose a Chebyshev center-based latent initial state optimization, an elegant theoretical solution from a geometric perspective, that further enhances restoration performance. Our improvements consistently achieve superior results across nine benchmark datasets.

A Geometric Perspective on Optimizing Vector Quantized Latent Diffusion Model for Image Restoration

AI personal assistants, deployed through robots or wearables, require embodied understanding to collaborate effectively with humans. However, current Multimodal Large Language Models (MLLMs) primarily focus on third-person (exocentric) vision, overlooking the unique challenges of first-person (egocentric) videos. Additionally, high acquisition costs limit data size, impairing MLLM performance. To address these challenges, we propose learning the mapping between exocentric and egocentric domains, leveraging the extensive exocentric knowledge within existing MLLMs to enhance egocentric video understanding. To this end, we introduce Ego-ExoClip, a pre-training dataset comprising 1.1M synchronized ego-exo clip-text pairs derived from Ego-Exo4D, together with the instruction-tuning dataset EgoIT, which is collected from multiple sources to enhance the model's instruction-following capabilities. Building upon the datasets, we propose a migration strategy and further design a progressive mapping learning pipeline with three stages: Demonstrator Self-Preparation, Demonstrator-Learner Guidance, and Learner Self-Practice. Extensive experiments across diverse egocentric tasks reveal that existing MLLMs perform inadequately in egocentric video understanding, while our model significantly outperforms these leading models.

Exo2Ego: Exocentric Knowledge Guided MLLM for Egocentric Video Understanding

Perceiving threats is an innate human instinct. During driving, humans naturally focus their attention on objects that pose real potential risks. Motivated by this observation, we shift the focus from traditional class-based detection to a novel task termed threat-oriented reasoning detection in autonomous driving. This task aims to localize threat objects and reason about their threat levels from a driver-centric perspective. To support this task, we build a benchmark comprising diverse corner-case scenarios, annotated by multiple experienced drivers to reflect human-aligned threat cognition. Given the reasoning demands of this task, we then explore the capabilities of multi-modal large language models (MLLMs) and introduce two methods based on whether the MLLM supports object detection: 1) For MLLMs lacking detection capability, we introduce ThreatCoT, a plug-and-play training-free method that combines chain-of-thought (CoT) with a visual expert toolchain to support step-by-step reasoning. 2) For MLLMs with detection support, we introduce ThreatReasoner, an end-to-end reinforcement learning (RL)-based method built on the GRPO algorithm, which enables per-object reasoning through a fully unsupervised reward strategy. Both quantitative and qualitative experiments show that our methods can effectively unlock the new capabilities of MLLM in threat-oriented reasoning detection. Code and data are available in https://github.com/harrylin-hyl/Threat-ReasonDet.

Content not yet available

Next from AAAI 2026

RSOD: Reliability-Guided Sonar Image Object Detection with Extremely Limited Labels

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES