Singapore

Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM‑Agent efficiency, hindering targeted improvements. 
To this end, we introduce dual‑efficiency, comprising 
(i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. 
Building on this definition, we propose **DEPO**, a dual-efficiency preference‑based optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that **DEPO** cuts token usage by up to 60.9\% and steps by up to 26.9\%, while achieving up to a 29.3\% improvement in task performance. **DEPO** also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25\% of the data. The code is available in Appendix.

AAAI 2026

DEPO: Dual-Efficiency Preference Optimization for LLM Agents

preference optimization

llm agent

reinforcement learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Bipartite learning is a machine learning task aimed at predicting interactions among pairs of instances. It has been applied to a variety of domains, including drug-target interaction, RNA-disease association and regulatory network inference. Despite widely investigated, current methods still present drawbacks, as they are often designed for a specific application and thus do not generalize to other problems, or present scalability issues. To address these challenges, we propose Oxytrees: proxy-based biclustering model trees. Oxytrees compress the interaction matrix into row- and column-wise proxy matrices to significantly reduce training time without impacting predictive performance. We also propose a new leaf-assignment algorithm that significantly reduces the time taken for prediction. Finally, Oxytrees employ linear models using the Kronecker product kernel in their leaves, resulting in shallower trees and thus even faster training. Using 15 datasets, we compared the predictive performance of ensembles of Oxytrees against the current state-of-the-art. We achieve up to 30-fold improvement in training times against the state-of-the-art biclustering forests, while showing competitive or superior performance in most evaluation settings, especially in the inductive setting. Finally, we provide an intuitive Python API to access all datasets, methods and evaluation measures used in this work, thus enabling reproducible research in this field.

Oxytrees: Model Trees for Bipartite Learning

Graph Neural Networks (GNNs) have effectively improved the performance of Cognitive Diagnosis Models (CDMs). Existing works have proposed a series of Graph-based Cognitive Diagnosis Frameworks (GCDFs) to enhance robustness to noise. However, these robust designs are often general methods for GNNs and are not designed for cognitive diagnosis, which undermines real cognitive information during the denoising process. Interestingly, a noteworthy phenomenon has been overlooked: even without robustness designs, GCDFs can still learn correct information in noisy environments. In this paper, we conduct a comprehensive empirical analysis of this issue. We found that noise primarily accumulates in lower singular components. Even in noisy environments, the principal subspaces of representations still remain stable. Based on these findings, we propose a Noise-aware Cognitive Diagnostic framework based on Low-rank Alignment, named NCDLA. The framework first performs low-rank reconstruction of the interaction matrix between students and exercises, retaining only larger singular values to achieve noise reduction. Then, the reconstructed interaction matrix and the original interaction matrix are combined with the Q matrix to form a noise-reduced heterogeneous graph and an original heterogeneous graph. In order to distinguish between the interaction patterns of correct and incorrect responses, we decompose the heterogeneous graph according to the type of response. NCDLA achieves denoising of student representations and exercises representations through a self-supervised strategy based on low-rank reconstruction and a spectral anchor regularisation method. Extensive experiments on three datasets demonstrate that NCDLA achieves optimal prediction performance and robustness.

Noise-Aware Graph-Based Cognitive Diagnostic Framework Through Low-Rank Alignment

The text-to-SQL task is an active challenge in Natural Language Processing. Many existing solutions focus on using black-box language models extended with specialized components within customized end-to-end text-to-SQL pipelines. While these solutions use both closed-source proprietary language models and coding-oriented open-source models, there is a lack of research regarding SQL-specific small generative models. At the same time, recent advancements in self-correcting generation strategies show promise for improving the capabilities of existing architectures. The application of these concepts to the text-to-SQL task remains unexplored.
In this paper, we introduce RetrySQL, a new approach to training text-to-SQL generation models. We prepare reasoning steps for reference SQL queries and then corrupt them to create retry data that contains both incorrect and corrected steps, divided with a special token. We continuously pre-train open-source coding models with this data and demonstrate that retry steps yield an improvements of up to 4 and 9 percentage points for overall and challenging execution metrics, respectively, as compared to pre-training without retry data. We showcase that the self-correcting behavior is learned by the model and the increase in downstream accuracy metrics is a result of this additional skill. Finally, we incorporate RetrySQL-trained models into the full text-to-SQL pipeline and showcase that they are competitive in terms of execution accuracy with proprietary models that contain orders of magnitude more parameters.
RetrySQL demonstrates that self-correction can be learned in the text-to-SQL task and provides a novel way of improving generation accuracy for small SQL-oriented language models.

RetrySQL: Text-to-SQL Training with Retry Data for Self-Correcting Query Generation

In this paper, we present Reed-Solomon coded single-stranded representation learning (RSRL), a novel end-to-end model for learning representations for lossless DNA storage of multi-type data.
In contrast to existing learning-based methods, the proposed RSRL is inspired by both error-correction codec and structural biology.
Specifically, RSRL first learns the representations for the subsequent storage from the binary data transformed by the Reed-Solomon codec.
Then, the representations are masked by an RS-code-informed mask to focus on correcting the burst errors occurring in the learning process.
With the decoded representations with error corrections, a novel biologically stabilized loss is formulated to regularize the data representations to possess stable single-stranded structures.
By incorporating these novel strategies, the proposed RSRL can learn highly durable, dense, and lossless representations for the subsequent storage tasks into DNA sequences.
The proposed RSRL has been compared with a number of strong baselines in real-world tasks of multi-type data storage.
The experimental results obtained demonstrate that RSRL can store diverse types of data with much higher information density and durability but much lower error rates.

Learning Structurally Stabilized Representations for Lossless DNA Storage

Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce **ConsistTalk**, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose an **optical flow-guided temporal module (OFT)** that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an **Audio-to-Intensity (A2I) model** obtained through multimodal teacher-student knowledge distillation. By transforming audio and facial velocity features into a frame-wise intensity sequence, the A2I model enables joint modeling of audio and visual motion, resulting in more natural dynamics. This further enables fine-grained, frame-wise control of motion dynamics while maintaining tight audio-visual synchronization. Third, we introduce a **diffusion noise initialization strategy (IC-Init)**. By enforcing explicit constraints on background coherence and motion continuity during inference-time noise search, we achieve better identity preservation and refine motion dynamics compared to the current autoregressive strategy. Extensive experiments demonstrate that ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos.

ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search

Recent advancements in diffusion models have made fine-tuning text-to-image models for personalization increasingly accessible, but have also raised significant concerns regarding unauthorized data usage and privacy infringement. Current protection methods are limited to passively degrading image quality, failing to achieve stable control. While Targeted Data Protection (TDP) offers a promising paradigm for active redirection toward user-specified target concepts, existing TDP attempts suffer from poor controllability due to snapshot-matching approaches that fail to account for complete learning dynamics. We introduce TAFAP (Trajectory Alignment via Fine-tuning with Adversarial Perturbations), the first method to successfully achieve effective TDP by controlling the entire training trajectory. Unlike snapshot-based methods whose protective influence is easily diluted as training progresses, TAFAP employs trajectory-matching inspired by dataset distillation to enforce persistent, verifiable transformations throughout fine-tuning. We validate our method through extensive experiments, demonstrating the first successful targeted transformation in diffusion models with simultaneous control over both identity and visual patterns. TAFAP significantly outperforms existing TDP attempts, achieving robust redirection toward target concepts while maintaining high image quality. This work enables verifiable safeguards and provides a new framework for controlling and tracing alterations in diffusion model outputs.

Targeted Data Protection for Diffusion Model by Matching Training Trajectory

Deploying multi-agent reinforcement learning (MARL) in safety-critical systems faces significant challenges due to insufficient agent exploration and inadequate safety constraint guarantees. Current approaches are constrained by two fundamental limitations: inefficient exploration leading to suboptimal policies, and expected-cost-based constraint frameworks failing to ensure full-process safety. To address these challenges, this paper proposes a novel safety-aware maximum entropy MARL framework using Conditional Value-at-Risk (CVaR) as a joint safety metric, which quantifies constraint satisfaction under worst-case scenarios for multi-agent systems. Moreover, we develop the Worst-Case Multi-Agent Soft Actor-Critic (WCMASAC) algorithm, incorporating sequential update mechanisms and maximum entropy optimization for heterogeneous agents, enhanced with distributed safety critics. Theoretically, we establish the monotonic improvement property, guaranteed constraint satisfaction, and convergence to a generalized Nash equilibrium for WCMASAC. Extensive experiments on Safety-Gymnasium based benchmarks demonstrate that WCMASAC outperforms state-of-the-art baselines in both task reward acquisition and safety constraint violation reduction, while exhibiting superior exploration efficiency and risk-aware control capabilities.

Safe Multi-Agent Reinforcement Learning via Distributional Safety Critic and Maximum Entropy Optimization

Verifying the complex and multi-step reasoning of Large Language Models (LLMs) is a critical challenge, as holistic methods often overlook localized flaws. 
Step-by-step validation is a promising alternative, yet existing methods are often rigid. 
They struggle to adapt to diverse reasoning structures, from formal proofs to informal natural language narratives. 
To address this adaptability gap, we propose the Graph of Verification (GoV), a novel framework for adaptable and multi-granular verification. 
GoV's core innovation is its flexible node block architecture. 
This mechanism allows GoV to adaptively adjust its verification granularity—from atomic steps for formal tasks to entire paragraphs for natural language—to match the native structure of the reasoning process. 
This flexibility allows GoV to resolve the fundamental trade-off between verification precision and robustness.
Experiments on both well-structured and loosely-structured benchmarks demonstrate GoV's versatility. 
The results show that GoV's adaptive approach significantly outperforms both holistic baselines and other state-of-the-art decomposition-based methods, establishing a new standard for training-free reasoning verification.

Graph of Verification: Structured Verification of LLM Reasoning with Directed Acyclic Graphs

The exponential growth of streaming multi-modal data presents critical challenges for cross-modal retrieval: distribution shifts, modality gap, and scarce labels. Semi-supervised online cross-modal hashing has gained increasing interest due to its ability to encode complex streaming data and update hash functions simultaneously. Nevertheless, existing methods can hardly generate high-quality unsupervised hash codes, which fundamentally limits diversity and flexibility‌ during the retrieval process. To this end, we propose a novel method named Prototype Evolution Online Cross-modal Hashing (PEOCH). With semi-supervised streaming data driving prototype evolution, precise and stable hash codes can be generated for both labeled and unlabeled data. Specifically, two simultaneous prototype updates are performed: labeled samples push semantic knowledge into the prototypes, while unlabeled samples pull prototypes to guide hash code generation. A co-optimization mechanism is designed to ensure the prototypes continuously evolve based on the entire streaming data. Besides, an elasticity regularizer integrates discriminability and smoothness constraints, improving the reliability of prototypes. We provide rigorous theoretical guarantees that ensure prototype stability. Extensive experiments on three benchmark datasets demonstrate that PEOCH outperforms state-of-the-art methods, achieving an average improvement of 6.7\% in mAP@all across various retrieval tasks.

PEOCH: Online Cross-Modal Hashing with Semi-Supervised Streaming Data Driving Prototype Evolution

Mixture-of-Experts (MoE) architectures have recently become a more prevalent choice for large language models (LLMs) than dense architectures due to their superior performance. However, billions of parameters bring MoE LLMs a huge cost for deployment and inference. To address these issues, knowledge distillation (KD) has become a widely adopted technique to compress LLMs. Existing KD methods for LLMs can be divided into *dense-to-dense* and *moe-to-dense* distillation. *Dense-to-dense* distillation transfers knowledge between single dense LLMs, while *moe-to-dense* distillation attempts to transfer knowledge between the MoE LLMs and the dense LLMs. However, the architectural mismatch prevents the student from fully absorbing knowledge when distilling MoE LLMs. To address this limitation, we investigate a new distillation setting, *moe-to-moe*, which aims to fully leverage expert knowledge of teachers and enable the student to absorb it more effectively. Compared to *dense-to-dense* and *moe-to-dense*, *moe-to-moe* suffers from two imbalance issues. First, expert-coverage deficiency reflects an imbalanced knowledge transfer of teacher experts: traditional distillation utilizes only the few experts activated by the teacher router. Second, routing imbalance appears when the student routing distribution drifts from the teacher, which makes it difficult for students to learn how to distribute different experts. To overcome these issues, we propose a novel distillation framework for *moe-to-moe*, **B**alanced **Distill**ation (**B-Distill**), which equally spreads teacher expertise across student experts while regularizing the student router toward teacher-consistent balance. First, to mitigate expert-coverage deficiency, we introduce Monte Carlo exploration, which stochastically perturbs router probabilities so every teacher and student expert is sampled without enlarging the search space. Second, to correct routing imbalance and avert load collapse, we propose an entropy-aware router distillation mechanism that aligns the student router with the teacher while curbing over-concentration. Experiments in various datasets show that B-Distill outperforms baselines by up to 6.6\% in Rouge-L. Our code can be available at https://anonymous.4open.science/r/moedistill-D5FC/.

Downloads

Next from AAAI 2026

Oxytrees: Model Trees for Bipartite Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Oxytrees: Model Trees for Bipartite Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads