Singapore

Goal-conditioned Hierarchical Reinforcement Learning (GCHRL) has demonstrated effectiveness in addressing complicated decision-making tasks by providing &#39;&#39;temporal extraction&#39;&#39;, which decomposes tasks into smaller and more manageable &#39;&#39;subgoals&#39;&#39;. This enables agents to plan over a longer time scale. However, achieving optimal exploration and exploitation still remains a challenge, especially for long-horizon or sparse-reward scenarios. In this paper, we introduce Active exploraion and hierarchical Self-Imitation (ASI), an effective scheme to enhance exploration and exploitation based on subgoal representation learning. The key point of ASI is to utilize temporal adjacency information in the representation space. We construct and dynamically update an adjacency graph that captures the relationships between subgoals. Based on the adjacency information provided by the graph, we design two mechanisms: (1) active ``frontier-reaching&#39;&#39; exploration that faster expands the explored area by targeting boundary regions, and (2) hierarchical self-imitation learning that leverages historical experience to facilitate both frontier reaching and policy training. Experimental results show that our method accelerates exploration and outperforms existing baselines in challenging long-horizon continuous control tasks.

AAAI 2026

Enhancing Exploration and Exploitation in Hierarchical Reinforcement Learning with Subgoal Graph Learning

ml: reinforcement learning

ml: deep learning algorithms

ml: representation learning

Goal-conditioned Hierarchical Reinforcement Learning (GCHRL) has demonstrated effectiveness in addressing complicated decision-making tasks by providing ''temporal extraction'', which decomposes tasks into smaller and more manageable ''subgoals''. This enables agents to plan over a longer time scale. However, achieving optimal exploration and exploitation still remains a challenge, especially for long-horizon or sparse-reward scenarios. In this paper, we introduce Active exploraion and hierarchical Self-Imitation (ASI), an effective scheme to enhance exploration and exploitation based on subgoal representation learning. The key point of ASI is to utilize temporal adjacency information in the representation space. We construct and dynamically update an adjacency graph that captures the relationships between subgoals. Based on the adjacency information provided by the graph, we design two mechanisms: (1) active ``frontier-reaching'' exploration that faster expands the explored area by targeting boundary regions, and (2) hierarchical self-imitation learning that leverages historical experience to facilitate both frontier reaching and policy training. Experimental results show that our method accelerates exploration and outperforms existing baselines in challenging long-horizon continuous control tasks.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Multi-objective combinatorial optimization problems (MOCOP) frequently arise in practical applications that require the simultaneous optimization of conflicting objectives. Although traditional evolutionary algorithms can be effective, they typically depend on domain knowledge and repeated parameter tuning, limiting flexibility when applied to unseen MOCOP instances. Recently, integration of Large Language Models (LLMs) into evolutionary computation has opened new avenues for automatic heuristic generation, using their advanced language understanding and code synthesis capabilities. Nevertheless, most existing approaches predominantly focus on single-objective tasks, often neglecting key considerations such as runtime efficiency and heuristic diversity in multi-objective settings. To bridge this gap, we introduce Multi-heuristics for MOCOP via Pareto-Grid-guided Evolution of LLMs (MPaGE), a novel enhancement of the Simple Evolutionary Multiobjective Optimization (SEMO) framework that leverages LLMs and Pareto Front Grid (PFG) technique. By partitioning the objective space into grids and retaining top-performing candidates to guide heuristic generation, MPaGE utilizes LLMs to prioritize heuristics with semantically distinct logical structures during variation, thus promoting diversity and mitigating redundancy within the population. Through extensive evaluations, MPaGE demonstrates superior performance over existing LLM-based frameworks, and achieves competitive results to traditional Multi-objective evolutionary algorithms (MOEAs), with significantly faster runtime.

Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization

Joint multilingual instruction tuning is a widely adopted approach to improve the multilingual instruction-following ability and downstream performance of large language models (LLMs), but the resulting multilingual capability remains highly sensitive to the composition and selection of the training data.
Existing selection methods, often based on features like text quality, diversity, or task relevance, typically overlook the intrinsic linguistic structure of multilingual data.
In this paper, we propose LangGPS, a lightweight two-stage pre-selection framework guided by language separability—a signal that quantifies how well samples in different languages can be distinguished in the model’s representation space. LangGPS first filters training data based on separability scores and then refines the subset using existing selection methods.
Extensive experiments across six benchmarks and 22 languages demonstrate that applying LangGPS on top of existing selection methods improves their effectiveness and generalizability in multilingual training, especially for understanding tasks and low-resource languages.
Further analysis reveals that highly separable samples facilitate the formation of clearer language boundaries and support faster adaptation, while low-separability samples tend to function as bridges for cross-lingual alignment.
Besides, we also find that language separability can serves as an effective signal for multilingual curriculum learning, where interleaving samples with diverse separability levels yields stable and generalizable gains.
Together, we hope our work offers a new perspective on data utility in multilingual contexts and support the development of more linguistically informed LLMs.

LangGPS: Language Separability Guided Data Pre-Selection for Joint Multilingual Instruction Tuning

In this paper, we investigate code-integrated reasoning (CIR), where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL). Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach ETIR (Effective TIR) to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism of code-integrated reasoning, revealing several key insights, such as the extension of model’s capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. These findings underscore the potential of code-integrated reasoning as a scalable paradigm for advancing robust and efficient language model reasoning.

Towards Effective Code-Integrated Reasoning

Efficiently fine-tuning pre-trained models for downstream tasks is a key challenge in the era of foundation models. Parameter-efficient fine-tuning (PEFT) presents a promising solution, achieving performance comparable to full fine-tuning by updating only a small number of adaptation weights per layer. Traditional PEFT methods typically rely on a single expert, where the adaptation weight is a low-rank matrix. However, for complex tasks, the data's inherent diversity poses a significant challenge for such models, as a single adaptation weight cannot adequately capture the features of all samples. To address this limitation, we explore how to integrate multiple **small** adaptation experts into a compact structure to defeat a **large** adapter. Specifically, we propose Tucker Adaptation (TuckA), a method with four key properties: (i) We use Tucker decomposition to create a compact 3D tensor where each slice naturally serves as an expert. The low-rank nature of this decomposition ensures that the number of parameters scales efficiently as more experts are added. (ii) We introduce a hierarchical strategy that organizes these experts into groups at different granularities, allowing the model to capture both local and global data patterns. (iii) We develop an efficient batch-level routing mechanism, which reduces the router's parameter size by a factor of $L$ compared to routing at every adapted layer (where $L$ is the number of adapted layers) (iv) We propose data-aware initialization to achieve loss-free expert load balancing based on theoretical analysis. Extensive experiments on benchmarks in natural language understanding, image classification, and mathematical reasoning speak to the efficacy of TuckA, offering a new and effective solution to the PEFT problem.

TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning

Clinical reinforcement learning (RL) holds promise for treatment recommendation but remains hindered by black-box decision processes, limited safety guarantees, and lack of individualized reasoning. We introduce Delphi Engine, the first fully trainable neuro-symbolic causal RL framework for dynamic treatment planning, designed to answer three core clinical questions in real time: Why this action? Why is it safe? Why for this patient?
Specifically, Delphi integrates: (1) causality-aware state modeling using discretized physiological variables and subtype-specific causal graphs; (2) adaptive symbolic rule constraints, combining clinical guidelines and behavior-derived rules into soft differentiable logic; and (3) interpretable decision fusion, where actions are selected based on joint neural-symbolic Q-values and explained via structured LLM-based justifications.
We evaluate Delphi on the MIMIC-III sepsis cohort using both standard off-policy evaluations (WIS↑1.47, DR↑1.29, RMSE↓0.207) and the first blinded physician evaluation of an explainable RL system in healthcare. Delphi consistently outperforms historical physicians' treatments in safety (+10.4\%), understandability (+8.9\%), and adoption rate (+5.75\%) across six clinical axes. These results highlight Delphi’s potential as a safe, interpretable, and patient-specific AI assistant for critical care.

Delphi: A Neuro-Symbolic Framework for Individualized, Safe and Interpretable Treatment Recommendation

Zero-Shot Anomaly Detection (ZSAD) aims to identify and localize anomalous regions in images of unseen object classes. While recent methods based on vision-language models like CLIP show promise, their performance is constrained by existing prompt engineering strategies. Current approaches, whether relying on single fixed, learnable, or dense dynamic prompts, suffer from a representational bottleneck and are prone to overfitting on auxiliary data, failing to generalize to the complexity and diversity of unseen anomalies. To overcome these limitations, we propose $\mathtt{PromptMoE}$. Our core insight is that robust ZSAD requires a compositional approach to prompt learning. Instead of learning monolithic prompts, $\mathtt{PromptMoE}$ learns a pool of expert prompts, which serve as a basis set of composable semantic primitives, and a visually-guided Mixture-of-Experts (MoE) mechanism to dynamically combine them for each instance.
Our framework materializes this concept through a Visually-Guided Mixture of Prompt (VGMoP) that employs an image-gated sparse MoE to aggregate diverse normal and abnormal expert state prompts, generating semantically rich textual representations with strong generalization. Extensive experiments across 15 datasets in industrial and medical domains demonstrate the effectiveness and state-of-the-art performance of $\mathtt{PromptMoE}$.

PromptMoE: Generalizable Zero-Shot Anomaly Detection via Visually-Guided Prompt Mixtures

Fine-tuning large language models is essential for task-specific adaptation, yet it remains computationally prohibitive. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a solution, but current approaches typically ignore the distinct roles of model components and the heterogeneous importance across layers, thereby limiting adaptation efficiency.
Motivated by the observation that Rotary Position Embeddings (RoPE) induce critical activations in the low-frequency dimensions of attention states, we propose RoPE-aware Selective Adaptation (RoSA), a novel PEFT framework that allocates trainable parameters in a more targeted and effective manner.
RoSA comprises a RoPE-aware Attention Enhancement (RoAE) module, which selectively enhances the low-frequency components of RoPE-influenced attention states, and a Dynamic Layer Selection (DLS) strategy that adaptively identifies and updates the most critical layers based on LayerNorm gradient norms.
By combining dimension-wise enhancement with layer-wise adaptation, RoSA achieves more targeted and efficient fine-tuning.
Extensive experiments on fifteen commonsense and arithmetic benchmarks demonstrate that RoSA outperforms existing mainstream PEFT methods under comparable trainable parameters. The code is available to ease reproducibility.

RoSA: Enhancing Parameter-Efficient Fine-Tuning via RoPE-aware Selective Adaptation in Large Language Models

Scanpath prediction in omnidirectional images (ODIs) serves as a critical component for optimizing foveated rendering efficiency and enhancing interactive quality in virtual reality systems. However, existing scanpath prediction methods for ODIs still suffer from fundamental limitations: (1) inadequate modeling and capturing of long-range temporal dependencies in fixation regions, and (2) suboptimal integration of spatial and temporal visual features, ultimately compromising prediction performance. To address these limitations, we propose a novel Dual-Temporal Modulated Diffusion model for Omnidirectional Images Scanpath Prediction, named SalDiff-DTM model, to effectively generate realistic human eye viewing trajectories. Specifically, to effectively model spatial relationships, we propose a novel Dual-Graph Convolutional Network (Dual-GCN) module that simultaneously captures semantic-level and image-level correlations. By integrating both local spatial details and global contextual information across the internal temporal dimension, this module achieves comprehensive and robust modeling of spatial relationships. To further enhance the modeling of temporal dependencies inherent in diverse fixation patterns, we introduce TABiMamba (Temporal-Aware BiLSTM-Mamba), a dedicated module that synergistically combines the contextual sensitivity of BiLSTM with the long-range sequence modeling capabilities of Mamba. This design facilitates deep information flow and context-aware sequential reasoning, thereby enabling high-fidelity capture of intricate temporal correlations. Inspired by the progressive refinement mechanism of diffusion models in various generative tasks, we propose a saliency-guided diffusion module that formulates the prediction problem as a conditional generative process, iteratively yielding accurate and perceptually plausible scanpaths. Extensive experiments demonstrate that SalDiff-DTM significantly outperforms state-of-the-art models, paving the way for future advancements in eye-tracking technologies and cognitive modeling, while broadening the horizons for immersive VR development.

SalDiff-DTM: A Novel Dual-Temporal Modulated Diffusion Model for Omnidirectional Images Scanpath Prediction

Recent unified models have demonstrated that the reasoning capacity of Multimodal Large Language Models (MLLMs) can be leveraged to facilitate diffusion-based image generation with impressive flexibility and performance. However, approaches that rely heavily on MLLMs for high-level semantic encoding often struggle with fine-grained visual tasks like image editing and virtual try-on. To address this gap, we propose {FUSE}, a unified framework excelling at both high-level vision–language understanding and fine-grained generation. First, we introduce a Semantic-to-Detail Connector that pre-aligns fine-grained visual features with the MLLM's semantic space. This design counteracts the low-level information loss inherent in MLLM encodings, creating a unified representation that steers the diffusion process with both global semantics and rich local details. Second, to further enhance semantic awareness and detail preservation, we introduce Adaptive-GRPO, a post-training objective that dynamically balances semantic coherence against pixel-level fidelity. The integration of these two innovations allows FUSE to generate images that are both semantically faithful and visually fine-grained. Comprehensive experiments on text-to-image and instruction-guided editing benchmarks show that FUSE significantly outperforms existing unified baselines, achieving 0.89 on Geneval, 0.65 on WISE, and 3.88 on ImageEdit.

FUSE: Fine-Grained and Semantic-Aware Learning for Unified Image Understanding and Generation

Mechanical ventilation is essential in intensive care units (ICUs), but prolonged use increases patient risk. Reinforcement learning (RL) offers potential for optimizing ventilator management, yet its clinical adoption is limited by the lack of interpretable and realistic simulation environments. We propose an interpretable and probabilistic patient environment simulator based on action-based k-nearest neighbors and empirical transition probabilities, modeling stochastic state transitions grounded in real ICU data (MIMIC-IV and eICU). The simulator supports anomaly detection and provides probabilistic next-state distributions to enhance transparency and safety. Within this environment, we benchmark seven offline RL algorithms under clinically guided reward designs, including five distinct reward function configurations to explore the impact of reward shaping on agent behavior. Our results show that RL agents such as Double DQN and NFQ outperform empirical physician policies in meeting extubation guidelines, especially for high-severity patients. This benchmark enables standardized, interpretable evaluation of RL-based decision support tools for critical care.

Downloads

Next from AAAI 2026

Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Pareto-Grid-Guided Large Language Models for Fast and High-Quality Heuristics Design in Multi-Objective Combinatorial Optimization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads