Singapore

Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., up to 1.5B parameters) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs&#39; performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.

AAAI 2026

Decoupling Understanding from Reasoning via Problem Space Mapping for Small-Scale Model Reasoning

problem space mapping

small-scale

llm reasoning

reinforcement learning

Despite recent advances in the reasoning capabilities of Large Language Models (LLMs), improving the reasoning ability of Small Language Models (SLMs, e.g., up to 1.5B parameters) remains challenging. A key obstacle lies in the complexity and variability of natural language: essentially equivalent problems often appear in diverse surface forms, often obscured by redundant or distracting details. This imposes a dual burden on SLMs: they must first extract the core problem from complex linguistic input, and then perform reasoning based on that understanding. The resulting vast and noisy problem space hinders optimization, particularly for models with limited capacity. To address this, we propose a new framework that decouples understanding from reasoning by mapping natural language problems into a canonical problem space-a semantically simplified yet expressive domain. This enables SLMs to focus on reasoning over standardized inputs, free from linguistic variability. Within this framework, we introduce DURIT (Decoupled Understanding from Reasoning via Iterative Training), a three-step algorithm that iteratively: (1) mapping natural language problems via reinforcement learning, (2) aligns reasoning trajectories through self-distillation, and (3) trains reasoning policies in the problem space. The mapper and reasoner are co-trained in an alternating loop throughout this process. Experiments show that DURIT substantially improves SLMs' performance on both in-domain and out-of-domain mathematical and logical reasoning tasks. Beyond improving reasoning capabilities, DURIT also improves the robustness of reasoning, validating decoupling understanding from reasoning as an effective strategy for strengthening SLMs.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Generative reasoning with large language models (LLMs) often involves long decoding sequences, leading to substantial memory and latency overheads from accumulating key-value (KV) caches. While existing KV compression methods primarily focus on reducing prefill memory from long input sequences, they fall short in addressing the dynamic and layer-sensitive nature of long-form generation, which is central to reasoning tasks. We propose Lethe, a dynamic KV cache management framework that introduces adaptivity along both the spatial and temporal dimensions of decoding. Along the spatial dimension, Lethe performs layerwise sparsity-aware allocation, assigning token pruning budgets to each transformer layer based on estimated attention redundancy. Along the temporal dimension, Lethe conducts multi-round token pruning during generation, driven by a Recency-Aware Selective Retention (RASR) mechanism. RASR extends traditional recency-based heuristics by also considering token relevance derived from evolving attention patterns, enabling informed decisions about which tokens to retain or evict. Empirical results demonstrate that Lethe achieves a favorable balance between efficiency and generation quality across diverse models and tasks, increases throughput by up to 2.56×.

Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

Change captioning aims to describe changes between a pair of images. However, existing works rely on visual features alone, which often fail to capture subtle but meaningful changes because they lack the ability to represent explicitly structured information such as object relationships and compositional semantics. To alleviate this, we present CORTEX (COmpositional Reasoning-aware TEXt-guided), a novel framework that integrates complementary textual cues to enhance change understanding. In addition to capturing cues from pixel-level differences, CORTEX utilizes scene-level textual knowledge provided by Vision Language Models (VLMs) to extract richer image text signals that reveal underlying compositional reasoning. CORTEX consists of three key modules: (i) an Image-level Change Detector that identifies low-level visual differences between paired images, (ii) a Reasoning-aware Text Extraction (RTE) module that use VLMs to generate compositional reasoning descriptions implicit in visual features, and (iii) an Image-Text Dual Alignment (ITDA) module that aligns visual and textual features for fine-grained relational reasoning. This enables CORTEX to reason over visual and textual features and capture changes that are otherwise ambiguous in visual features alone.

Leveraging Textual Compositional Reasoning for Robust Change Captioning

The explosive growth of artificial intelligence is exponentially escalating computational demand, inflating data-centre energy use and carbon emissions, and spurring rapid deployment of green data centres to relieve resource and environmental stress. Achieving sub-minute orchestration of renewables, smart grids, storage, and thermo-fluid loads, while minimising PUE and lifecycle carbon intensity, hinges on accurate load forecasting. Existing rule-based or thermodynamic methods are subjective and costly. Although typical deep time-series networks perform well on large-scale datasets, fail in cold-start, fragmented, or drifting environments typical of nascent green data centres. We introduce HyperLoad, a cross-modal framework that exploits large-scale pre-trained language models (LLMs) to overcome data scarcity. In the alignment stage, textual priors and time-series attributes are mapped to a common latent space, maximizing prior-knowledge utility. In the modeling stage, domain-aligned priors are injected through adaptive prefix tuning, enabling rapid scenario adaptation, while a global interactive self-attention module captures cross-device temporal dependencies. The public GreenData dataset is released for benchmarking. Under both data-sufficient and data-scarce regimes, HyperLoad consistently surpasses state-of-the-art baselines, demonstrating its practicality for sustainable data-centre management.

HyperLoad: A Cross-Modality Enhanced Large Language Model-Based Framework for Green Data Center Cooling Load Prediction

Text-to-visible \& infrared person retrieval aims to retrieve the corresponding visible (RGB) and thermal infrared (TIR) images given the text descriptions. Existing methods perform semantic decoupling by aligning RGB and TIR features separately to different attributes, thereby facilitating the alignment between the fused multimodal representation and the text. However, insufficient TIR representation ability and cross-view representation capabilities of RGB and TIR modalities limit the retrieval accuracy and robustness. To address these issues, we propose a novel Dual-teacher Interactive Knowledge Distillation Network called DIKDNet, which performs the interactive knowledge distillation between two modality-specific teachers with rich cross-view representation capabilities to enhance TIR representations and the collaborative knowledge distillation from both teachers to the corresponding students to enhance the cross-modal cross-view representations, for robust text-to-visible \& infrared person retrieval. Specifically, to enhance the representation ability of the TIR backbone network while preserving modality-specific characteristics, we design an Interactive Knowledge Distillation Module (IKDM), which introduces a boundary-constrained distillation strategy between RGB and TIR backbones, to transfer the semantic features of RGB backbone to TIR one. To enhance the cross-modal cross-view representation capability, we design a Collaborative Knowledge Distillation Module (CKDM) to transfer the cross-modal similarity relations and the cross-view multimodal representations from teacher networks to student ones. Experimental results demonstrate that our method consistently achieves significant performance gains on both the RGBT-PEDES and RGBNT201-PEDES datasets. The code will be released upon the acceptance.

Dual-Teacher Interactive Knowledge Distillation Network for Text-to-Visible & Infrared Person Retrieval

Current 3D visual grounding tasks only process sentence-level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision-language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 55,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.

3D-DRES: Detailed 3D Referring Expression Segmentation

Multimodal Large Language Models (MLLMs) have achieved remarkable progress in multimodal reasoning. However, they often excessively rely on textual information during the later stages of inference, neglecting the crucial integration of visual input. Current methods typically address this by explicitly injecting visual information to guide the reasoning process. In this work, through an analysis of MLLM attention patterns, we made an intriguing observation: with appropriate guidance, MLLMs can spontaneously re-focus their attention on visual inputs during the later stages of reasoning, even without explicit visual injection. This spontaneous shift in focus suggests that MLLMs are intrinsically capable of performing visual fusion reasoning. Building on this insight, we introduce **Look-Back**, an implicit approach designed to guide MLLMs to "look back" at visual information in a self-directed manner during reasoning. Look-Back empowers the model to autonomously determine when, where, and how to re-focus on visual inputs, eliminating the need for explicit model-structure constraints or additional input. We demonstrate that Look-Back significantly enhances the model's reasoning and perception capabilities, as evidenced by extensive empirical evaluations on multiple multimodal benchmarks.

Look-Back: Implicit Visual Re-focusing in MLLM Reasoning

Stereo matching recovers 3D scene information based on the correlation between corresponding pixels. Despite impressive progress, existing methods lack sufficient correlation priors in ill-posed regions such as occlusions, detailed and reflective regions. In this paper, we propose Geometry Aware Stereo Matching Network (GEAStereo) to enhance geometric structure perception and address this issue. We adaptively incorporate the Monocular Disparity Distribution Prior into the stereo cost volume, building Mono-Stereo Fusion Volume (MSFV), which effectively captures global geometric structures and rectifies the correlation information in ill-posed regions. Furthermore, we introduce rich detail information from gradient features and construct a Detail-Aware Volume (DAV) by aggregating the group-wise cost volume under the guidance of gradient spatial attention, thus enhancing the correlation modeling in detailed structures. Jointly, MSFV and DAV provide rich correlation priors for disparity iterative optimization. Experimental results show that our method achieves competitive results on the ETH3D and KITTI2015 benchmarks. Compared with the state-of-the-art methods, our method demonstrates stronger performance in zero-shot generalization. The code is available in Supplementary Material.

Geometry-Aware Stereo Matching via Monocular Disparity Distribution Prior and Gradient Enhancement

Visual anomaly detection is limited by the lack of sufficient anomaly data. While existing anomaly synthesis methods have made remarkable progress, achieving both realism and diversity in synthesis remains a major obstacle. To address this, we propose AnomalyPainter, a novel framework that breaks the diversity-realism trade-off dilemma through synergizing Vision Language Large Model (VLLM), Latent Diffusion Model (LDM), and our newly introduced texture library Tex-9K. Tex-9K is a professional texture library containing 75 categories and 8792 texture assets crafted for diverse anomaly synthesis. Leveraging VLLM's general knowledge, reasonable anomaly text descriptions are generated for each industrial object and matched with relevant diverse textures from Tex-9K. These textures then guide the LDM via ControlNet to paint on normal images. Furthermore, we introduce Texture-Aware Latent Init to stabilize the natural-image-trained ControlNet for industrial images. Extensive experiments show that AnomalyPainter outperforms existing methods in realism, diversity, and generalization, achieving superior downstream performance.

AnomalyPainter: Vision-Language-Diffusion Synergy for Realistic and Diverse Unseen Industrial Anomaly Synthesis

Video language models (VideoLMs) have made significant progress in multimodal understanding. However, temporal understanding, which involves identifying event order, duration, and relationships across time, still remains a core challenge. Prior works emphasize positional encodings (PEs) as a key mechanism for encoding temporal structure. Surprisingly, we find that removing or modifying PEs in video inputs yields minimal degradation in the performance of temporal understanding. In contrast, reversing the frame sequence while preserving the original PEs causes a substantial drop. To explain this behavior, we conduct substantial analysis experiments to trace how temporal information is integrated within the model. We uncover a causal information pathway: temporal cues are progressively synthesized through inter-frame attention, aggregated in the final frame, and subsequently integrated into the query tokens. This emergent mechanism shows that temporal reasoning emerges from inter visual token interactions under the constraints of causal attention, which implicitly encodes temporal structure. Based on these insights, we propose two efficiency-oriented strategies: staged cross-modal attention and a temporal exit mechanism for early token truncation. Experiments on two benchmarks validate the effectiveness of both approaches.

Causality Matters: How Temporal Information Emerges in Video Language Models

Infrared and Visible Image Fusion (IVIF) integrates complementary information from distinct modalities to enhance image quality. However, the effectiveness declines under unseen conditions such as novel weather or scenes, due to domain shifts primarily from variations of data distribution in the visible modality, while the infrared modality remains relatively stable. To overcome domain shifts caused by the imbalance between modalities during image fusion, we propose a Domain Adaptation Guided Infrared and Visible Image Fusion method, termed DAFusion, leveraging a dual-rank domain adapter to enable fast adaptation to diverse adverse conditions during image fusion. Specifically, trainable low-rank and high-rank embedding spaces are respectively used to capture knowledge common across domains (domain-shared) and those unique to target domains (domain-specific). To leverage the dual-rank adapter more effectively, we develop a homeostatic knowledge allotment strategy to integrate the distinct types of knowledge dynamically based on the uncertainty value of target domains. Since domain adaptation typically optimizes for feature alignment across domains and emphasizes invariance rather than preserving specific cues critical for image fusion, while the fusion objective requires retaining discriminative and complementary features, a conflict between the two modules appears. To reconcile this, we further adopt a bi-level optimization framework that structurally decouples the two objectives, enabling the fusion module to steer the adaptation process while benefiting in return from domain-aligned representations. Experimental results on three benchmarks demonstrate that our method significantly outperforms state-of-the-art approaches, achieving both an enhancement in fusion quality and an improvement on subsequent high-level tasks.

Downloads

Next from AAAI 2026

Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES