Singapore

In this paper, we investigate code-integrated reasoning (CIR), where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL). Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach ETIR (Effective TIR) to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism of code-integrated reasoning, revealing several key insights, such as the extension of model’s capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. These findings underscore the potential of code-integrated reasoning as a scalable paradigm for advancing robust and efficient language model reasoning.

AAAI 2026

Towards Effective Code-Integrated Reasoning

nlp: (large) language models

nlp: applications

ml: reinforcement learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Efficiently fine-tuning pre-trained models for downstream tasks is a key challenge in the era of foundation models. Parameter-efficient fine-tuning (PEFT) presents a promising solution, achieving performance comparable to full fine-tuning by updating only a small number of adaptation weights per layer. Traditional PEFT methods typically rely on a single expert, where the adaptation weight is a low-rank matrix. However, for complex tasks, the data's inherent diversity poses a significant challenge for such models, as a single adaptation weight cannot adequately capture the features of all samples. To address this limitation, we explore how to integrate multiple **small** adaptation experts into a compact structure to defeat a **large** adapter. Specifically, we propose Tucker Adaptation (TuckA), a method with four key properties: (i) We use Tucker decomposition to create a compact 3D tensor where each slice naturally serves as an expert. The low-rank nature of this decomposition ensures that the number of parameters scales efficiently as more experts are added. (ii) We introduce a hierarchical strategy that organizes these experts into groups at different granularities, allowing the model to capture both local and global data patterns. (iii) We develop an efficient batch-level routing mechanism, which reduces the router's parameter size by a factor of $L$ compared to routing at every adapted layer (where $L$ is the number of adapted layers) (iv) We propose data-aware initialization to achieve loss-free expert load balancing based on theoretical analysis. Extensive experiments on benchmarks in natural language understanding, image classification, and mathematical reasoning speak to the efficacy of TuckA, offering a new and effective solution to the PEFT problem.

TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning

Clinical reinforcement learning (RL) holds promise for treatment recommendation but remains hindered by black-box decision processes, limited safety guarantees, and lack of individualized reasoning. We introduce Delphi Engine, the first fully trainable neuro-symbolic causal RL framework for dynamic treatment planning, designed to answer three core clinical questions in real time: Why this action? Why is it safe? Why for this patient?
Specifically, Delphi integrates: (1) causality-aware state modeling using discretized physiological variables and subtype-specific causal graphs; (2) adaptive symbolic rule constraints, combining clinical guidelines and behavior-derived rules into soft differentiable logic; and (3) interpretable decision fusion, where actions are selected based on joint neural-symbolic Q-values and explained via structured LLM-based justifications.
We evaluate Delphi on the MIMIC-III sepsis cohort using both standard off-policy evaluations (WIS↑1.47, DR↑1.29, RMSE↓0.207) and the first blinded physician evaluation of an explainable RL system in healthcare. Delphi consistently outperforms historical physicians' treatments in safety (+10.4\%), understandability (+8.9\%), and adoption rate (+5.75\%) across six clinical axes. These results highlight Delphi’s potential as a safe, interpretable, and patient-specific AI assistant for critical care.

Delphi: A Neuro-Symbolic Framework for Individualized, Safe and Interpretable Treatment Recommendation

Zero-Shot Anomaly Detection (ZSAD) aims to identify and localize anomalous regions in images of unseen object classes. While recent methods based on vision-language models like CLIP show promise, their performance is constrained by existing prompt engineering strategies. Current approaches, whether relying on single fixed, learnable, or dense dynamic prompts, suffer from a representational bottleneck and are prone to overfitting on auxiliary data, failing to generalize to the complexity and diversity of unseen anomalies. To overcome these limitations, we propose $\mathtt{PromptMoE}$. Our core insight is that robust ZSAD requires a compositional approach to prompt learning. Instead of learning monolithic prompts, $\mathtt{PromptMoE}$ learns a pool of expert prompts, which serve as a basis set of composable semantic primitives, and a visually-guided Mixture-of-Experts (MoE) mechanism to dynamically combine them for each instance.
Our framework materializes this concept through a Visually-Guided Mixture of Prompt (VGMoP) that employs an image-gated sparse MoE to aggregate diverse normal and abnormal expert state prompts, generating semantically rich textual representations with strong generalization. Extensive experiments across 15 datasets in industrial and medical domains demonstrate the effectiveness and state-of-the-art performance of $\mathtt{PromptMoE}$.

PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixtures

Fine-tuning large language models is essential for task-specific adaptation, yet it remains computationally prohibitive. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a solution, but current approaches typically ignore the distinct roles of model components and the heterogeneous importance across layers, thereby limiting adaptation efficiency.
Motivated by the observation that Rotary Position Embeddings (RoPE) induce critical activations in the low-frequency dimensions of attention states, we propose RoPE-aware Selective Adaptation (RoSA), a novel PEFT framework that allocates trainable parameters in a more targeted and effective manner.
RoSA comprises a RoPE-aware Attention Enhancement (RoAE) module, which selectively enhances the low-frequency components of RoPE-influenced attention states, and a Dynamic Layer Selection (DLS) strategy that adaptively identifies and updates the most critical layers based on LayerNorm gradient norms.
By combining dimension-wise enhancement with layer-wise adaptation, RoSA achieves more targeted and efficient fine-tuning.
Extensive experiments on fifteen commonsense and arithmetic benchmarks demonstrate that RoSA outperforms existing mainstream PEFT methods under comparable trainable parameters. The code is available to ease reproducibility.

RoSA: Enhancing Parameter-Efficient Fine-Tuning via RoPE-aware Selective Adaptation in Large Language Models

Scanpath prediction in omnidirectional images (ODIs) serves as a critical component for optimizing foveated rendering efficiency and enhancing interactive quality in virtual reality systems. However, existing scanpath prediction methods for ODIs still suffer from fundamental limitations: (1) inadequate modeling and capturing of long-range temporal dependencies in fixation regions, and (2) suboptimal integration of spatial and temporal visual features, ultimately compromising prediction performance. To address these limitations, we propose a novel Dual-Temporal Modulated Diffusion model for Omnidirectional Images Scanpath Prediction, named SalDiff-DTM model, to effectively generate realistic human eye viewing trajectories. Specifically, to effectively model spatial relationships, we propose a novel Dual-Graph Convolutional Network (Dual-GCN) module that simultaneously captures semantic-level and image-level correlations. By integrating both local spatial details and global contextual information across the internal temporal dimension, this module achieves comprehensive and robust modeling of spatial relationships. To further enhance the modeling of temporal dependencies inherent in diverse fixation patterns, we introduce TABiMamba (Temporal-Aware BiLSTM-Mamba), a dedicated module that synergistically combines the contextual sensitivity of BiLSTM with the long-range sequence modeling capabilities of Mamba. This design facilitates deep information flow and context-aware sequential reasoning, thereby enabling high-fidelity capture of intricate temporal correlations. Inspired by the progressive refinement mechanism of diffusion models in various generative tasks, we propose a saliency-guided diffusion module that formulates the prediction problem as a conditional generative process, iteratively yielding accurate and perceptually plausible scanpaths. Extensive experiments demonstrate that SalDiff-DTM significantly outperforms state-of-the-art models, paving the way for future advancements in eye-tracking technologies and cognitive modeling, while broadening the horizons for immersive VR development.

SalDiff-DTM: A Novel Dual-Temporal Modulated Diffusion Model for Omnidirectional Images Scanpath Prediction

Recent unified models have demonstrated that the reasoning capacity of Multimodal Large Language Models (MLLMs) can be leveraged to facilitate diffusion-based image generation with impressive flexibility and performance. However, approaches that rely heavily on MLLMs for high-level semantic encoding often struggle with fine-grained visual tasks like image editing and virtual try-on. To address this gap, we propose {FUSE}, a unified framework excelling at both high-level vision–language understanding and fine-grained generation. First, we introduce a Semantic-to-Detail Connector that pre-aligns fine-grained visual features with the MLLM's semantic space. This design counteracts the low-level information loss inherent in MLLM encodings, creating a unified representation that steers the diffusion process with both global semantics and rich local details. Second, to further enhance semantic awareness and detail preservation, we introduce Adaptive-GRPO, a post-training objective that dynamically balances semantic coherence against pixel-level fidelity. The integration of these two innovations allows FUSE to generate images that are both semantically faithful and visually fine-grained. Comprehensive experiments on text-to-image and instruction-guided editing benchmarks show that FUSE significantly outperforms existing unified baselines, achieving 0.89 on Geneval, 0.65 on WISE, and 3.88 on ImageEdit.

FUSE: Fine-Grained and Semantic-Aware Learning for Unified Image Understanding and Generation

Mechanical ventilation is essential in intensive care units (ICUs), but prolonged use increases patient risk. Reinforcement learning (RL) offers potential for optimizing ventilator management, yet its clinical adoption is limited by the lack of interpretable and realistic simulation environments. We propose an interpretable and probabilistic patient environment simulator based on action-based k-nearest neighbors and empirical transition probabilities, modeling stochastic state transitions grounded in real ICU data (MIMIC-IV and eICU). The simulator supports anomaly detection and provides probabilistic next-state distributions to enhance transparency and safety. Within this environment, we benchmark seven offline RL algorithms under clinically guided reward designs, including five distinct reward function configurations to explore the impact of reward shaping on agent behavior. Our results show that RL agents such as Double DQN and NFQ outperform empirical physician policies in meeting extubation guidelines, especially for high-severity patients. This benchmark enables standardized, interpretable evaluation of RL-based decision support tools for critical care.

Benchmarking Reinforcement Learning Algorithms for ICU Ventilator Settings: An Interpretable and Probabilistic Patient Environment for Doctor Agents

The Dual Diffusion Implicit Bridge (DDIB) is an emerging image-to-image (I2I) translation method that preserves cycle consistency while achieving strong flexibility. It links two independently trained diffusion models (DMs) in the source and target domains by first adding noise to a source image to obtain a latent code, then denoising it in the target domain to generate the translated image. However, this method faces two key challenges: (1) low translation efficiency, and (2) translation trajectory deviations caused by mismatched latent distributions. To address these issues, we propose a novel I2I translation framework, OT-ALD, grounded in optimal transport (OT) theory, which retains the strengths of DDIB-based approach. Specifically, we compute an OT map from the latent distribution of the source domain to that of the target domain, and use the mapped distribution as the starting point for the reverse diffusion process in the target domain. Our error analysis confirms that OT-ALD eliminates latent distribution mismatches. Moreover, OT-ALD effectively balances faster image translation with improved image quality. Experiments on four translation tasks across three high-resolution datasets show that OT-ALD improves sampling efficiency by 20.29\% and reduces the FID score by 2.6 on average compared to the top-performing baseline models.

OT-ALD: Aligning Latent Distributions with Optimal Transport for Accelerated Image-to-Image Translation

Text-to-Image (T2I) models typically deploy safety mechanisms to prevent the generation of sensitive images. 
Unfortunately, recent jailbreaking attack methods manually design prompts for the LLM to generate adversarial prompts, which effectively exposing safety vulnerabilities of T2I models. 
However, existing methods have two limitations: 1) relying on manually exhaustive strategies for designing adversarial prompts, lacking a unified framework, and 2) requiring numerous queries to achieve a successful attack, limiting their practical applicability. 
To address this issue, we propose Reason2Attack~(R2A), which aims to enhance the effectiveness and efficiency of the LLM in jailbreaking attacks.
Specifically, we first use Frame Semantics theory to systematize existing manually crafted strategies and propose a unified generation framework to generate CoT adversarial prompts step by step. 
Following this, we propose a two-stage LLM reasoning training framework guided by the attack process.
In the first stage, the LLM is fine-tuned with CoT examples generated by the unified generation framework to internalize the adversarial prompt generation process grounded in Frame Semantics. In the second stage, we incorporate the jailbreaking task into the LLM's reinforcement learning process, guided by the proposed attack process reward function that balances prompt stealthiness, effectiveness, and length, enabling the LLM to understand T2I models and safety mechanisms.
Extensive experiments on various T2I models with safety mechanisms, and commercial T2I models, show the superiority and practicality of R2A.
\textbf{Note: This paper includes model-generated content that may contain offensive or distressing material.}

Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning

Prompt tuning has emerged as a lightweight strategy for adapting foundation models to downstream tasks, particularly for resource-constrained systems. As pre-trained prompts become valuable assets, combining multiple source prompts offers a promising approach to enhance generalization for new tasks by leveraging complementary knowledge. However, naive aggregation often overlooks different source prompts have different contribution potential to the target task.
To address this, we propose HGPrompt, a dynamic framework that learns optimal ensemble weights. These weights are optimized by jointly maximizing an information-theoretic metric for transferability and minimizing gradient conflicts via a novel regularization strategy. Specifically, we propose a differentiable prompt transferability metric to captures the discriminability of prompt-induced features on the target task. Meanwhile, HGPrompt match the gradient variances with respect to different source prompts based on Hessian and Fisher Information, ensuring stable and coherent knowledge transfer while suppressing gradient conflicts among them.
Extensive experiments on the large-scale VTAB benchmark demonstrate the state-of-the-art performance of HGPrompt, validating its effectiveness in learning an optimal ensemble for effective multi-source prompt transfer.

Content not yet available

Next from AAAI 2026

TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES