Singapore

Large language models have enabled sophisticated dialogue planning policy, but their reliance on LLM-generated simulation and feedback for policy optimization may introduce systematic preference bias. We present the first comprehensive analysis of preference bias in LLM-based dialogue planners, evaluating four state-of-the-art planning policies across three dialogue domains using multiple LLM families at varying scales. Our investigation reveals that all tested planners exhibit significant preference bias, systematically favoring narrow strategy sets rather than maintaining balanced distributions. User simulation emerges as the primary bias driver, while diverse persona simulation fails as an effective mitigation strategy. Most concerning, preference bias drives planners toward ethically problematic strategies that achieve short-term success while undermining real-world effectiveness and ethical standards. Our findings establish fundamental challenges for responsible deployment of LLM-based dialogue systems and provide crucial insights for developing more reliable and ethically-aligned planning approaches.

AAAI 2026

Simulated Rewards, Skewed Strategies: Tracing the Acquired Preference Bias in LLM-Based Dialogue Planners

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

With the wide adoption of online education platforms, adaptive learning systems have become increasingly important. Learning Path Recommendation (LPR) aims to dynamically adjust learning content to optimize learning efficiency based on individual student needs.
However, current LPR methods suffer from sparse reward for precise assessment and only focus on anonymous sessions that overlook more personalized and effective paths.
To address these challenges, we propose UNO, UNified Offline Training Paradigm for Learning Path Recommendation.
This approach introduces an offline training paradigm in RL-based LPR to provide dense process rewards by a personalized advantage based on a reward model, which can estimate the students' internal knowledge levels on the learning targets.
Additionally, we propose UniLPR model, a personalized recommendation system that unifies modeling the implicit relationships between students' long-term accumulation and evolving requirements for questions, and refines through Group Relative Policy Optimization(GRPO).
Finally, we design learning tasks that encompass historical reviewing, recent learning, and long-term exploratory learning to simulate the comprehensive and diverse learning needs of students. Our UNO achieves state-of-the-art performance across all tasks, demonstrating its effectiveness.

UNO! UNified Offline Training Paradigm for Learning Path Recommendation

With the exponential growth of video traffic, traditional video streaming systems are approaching their limits in communication capacity. To further reduce bitrate while maintaining quality, we propose Promptus, a disruptive semantic communication system that streaming prompts instead of videos, which represents real-world video frames with a series of "prompts" for delivery and employs Stable Diffusion to generate the same video at the receiver. To ensure that the generated video is pixel-aligned with the original video, a gradient descent-based prompt fitting framework is proposed. Further, a low-rank decomposition-based bitrate control algorithm is introduced to achieve adaptive bitrate. For inter-frame compression, an interpolation-aware fitting algorithm is proposed. Evaluations across various video genres demonstrate that, compared to H.265, Promptus can achieve more than a 4x bandwidth reduction while preserving the same perceptual quality. On the other hand, at extremely low bitrates, Promptus can enhance the perceptual quality by 0.139 and 0.118 (in LPIPS) compared to VAE and H.265, respectively, and decreases the ratio of severely distorted frames by 89.3% and 91.7%. Our work opens up a new paradigm for efficient video communication. Promptus is open-sourced at the supplementary material. Please refer to the code appendix.

Promptus: Can Prompt Streaming Replace Video Streaming

Multi-view clustering of remote sensing data presents significant challenges, as it integrates diverse data representations to improve Earth observation. Although existing anchor graph-based methods have yielded promising results, they generally exhibit two key limitations: (1) the time-consuming process of directly exploring pixel clustering structures, and (2) insufficient modeling of high-order correlations among different views. To address these issues, we propose an \textbf{E}fficient \textbf{T}ensorized multi-view anchor graph clustering method with \textbf{A}ffinity \textbf{P}ropagation (ETAP) for remote sensing data. Based on superpixel preprocessing, anchor graphs are learned from view-specific pixels and anchors, while compressed anchor graphs are simultaneously learned from the view-specific anchors. An adaptive weighting scheme is introduced to facilitate the learning of these anchor graphs. To capture high-order correlations, tensor Schatten $p$-norm regularization is applied to the compressed anchor graphs. A connectivity constraint is introduced to uncover the clustering structures of anchors. Finally, pixel clustering structures are then efficiently revealed from the pseudo-labeled anchors through affinity propagation without requiring additional clustering steps. To solve the proposed formulation, we develop an alternating optimization algorithm. Extensive experiments on three public datasets demonstrate the efficacy and efficiency of the proposed method over state-of-the-art methods.

Efficient Tensorized Multi-View Anchor Graph Clustering with Affinity Propagation for Remote Sensing Data

Answering complex medical questions requires not only adequate domain knowledge and patient information, but also multi-perspective and in-depth logical tree structure reasoning. Existing multi-intelligentsia approaches usually use fixed roles or simple hints for collaboration, which makes it difficult to detect and resolve fine-grained logical conflicts, leading to incomplete or imprecise conclusions. To this end, we propose MedLA: a logic-driven multi-intelligentsia framework based on a large language model. Each intelligence explicitly represents its reasoning process as a logical tree - a sequence of trinomial reasoning consisting of “major premise, minor premise → conclusion” - and guides the discussion through a multi-round graph. The logical tree is then compared, calibrated, and corrected through a multi-round graph-guided discussion. We also demonstrate that MedLA is better at solving difficult medical problems and has the potential to demonstrate higher performance with larger base models. On both the MedDDx Benchmarks and standard medical quiz benchmarks, MedLA leads substantially in accuracy and robustness compared to both the static role system and the single-intelligent body baseline, setting a new optimal score. The code of HDTree is available at anonymous link https://anonymous.4open.science/status/MedLA_review-1410.

MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models

Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent’s global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through comprehensive evaluations on agents based on frontier LLMs, BehaviorBench shows the effectiveness of Behavior Editing across different models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.

Model Editing as a Double-Edged Sword: Steering Agent Behavior Toward Beneficence or Harm

Unified physics-based humanoid controllers are pivotal for robotics and character animation, yet models that excel on gentle, everyday motions still stumble on explosive actions, hampering real-world deployment.
We bridge this gap with FARM (Frame-Accelerated Augmentation and Residual Mixture-of-Experts), an end-to-end framework composed of frame-accelerated augmentation, a robust base controller, and a residual mixture-of-experts (MoE). Frame-accelerated augmentation exposes the model to high-velocity pose changes by widening inter-frame gaps. The base controller reliably tracks everyday low-dynamic motions, while the residual MoE adaptively allocates additional network capacity to handle challenging high-dynamic actions, significantly enhancing tracking accuracy.
In the absence of a public benchmark, we curate the High-Dynamic Humanoid Motion (HDHM) dataset, comprising 3593 physically plausible clips.
On HDHM, FARM reduces the tracking failure rate by 42.8\% and lowers global mean per-joint position error by 14.6\% relative to the baseline, while preserving near-perfect accuracy on low-dynamic motions.
These results establish FARM as a new baseline for high-dynamic humanoid control and introduce the first open benchmark dedicated to this challenge.

FARM: Frame-Accelerated Augmentation and Residual Mixture-of-Experts for Physics-Based High-Dynamic Humanoid Control

Large language models (LLMs) have shown great potential in enhancing search and recommendation systems by providing rich semantic representations from unstructured texts. However, directly integrating LLM embeddings into industrial recommendation pipelines often results in subpar performance due to the semantic and distributional mismatch between pre-trained LLM features and domain-specific, feedback-driven representations. Existing approaches struggle to effectively align LLM embeddings with recommendation objectives, often facing challenges such as label misalignment or the potential loss of semantic diversity during fine-tuning. In this work, we present TreeBridge, a novel framework that introduces a structure-aware generative encoding tree to bridge the semantic gap between LLM embeddings and recommendation tasks. It preserves the external semantic richness of LLM embeddings, while learning label-informed structures that capture user preferences and interaction patterns. This enables the generation of task-adaptive representations without compromising embedding diversity. We further adopt an online-offline hybrid service paradigm to ensure low-latency real-world deployment.
TreeBridge has been deployed on the Shopee e-commerce platform, one of the largest online shopping platforms in Southeast Asia serving hundreds of millions of users. Since its deployment in May 2025, it has helped the company achieve a commercially significant 1.55\% relative improvement in gross merchandise volume (GMV). The deployment experience demonstrates the effectiveness, scalability, and significant commercial value of TreeBridge.

TreeBridge: Aligning LLM Embeddings in Industrial Recommender Systems

Large language models (LLMs) often show poor performance in low-resource languages like Korean, partly due to unique linguistic challenges such as homophonous Sino-Korean words that are indistinguishable in Hangul script. To address this semantic ambiguity, we propose HanjaBridge, a novel meaning-injection technique integrated into a continual pre-training (CPT) framework. Instead of deterministically mapping a word to a single Hanja (Chinese character), HanjaBridge presents the model with all possible Hanja candidates for a given homograph, encouraging the model to learn contextual disambiguation. This process is paired with token-level knowledge distillation to prevent catastrophic forgetting. Experimental results show that HanjaBridge significantly improves Korean language understanding, achieving a 21\% relative improvement on the KoBALT benchmark. Notably, by reinforcing semantic alignment between Korean and Chinese through shared Hanja, we observe a strong positive cross-lingual transfer. Furthermore, these gains persist even when Hanja augmentation is omitted at inference time, ensuring practical efficiency with no additional run-time cost.

HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training

Model-based reinforcement learning (MBRL) enables efficient decision-making by learning predictive world modelsof environment dynamics. Despite recent advances, existingmodels often struggle to reconcile accurate short-term transitions with coherent long-term planning, especially in partially observable or long-horizon settings. We argue that thislimitation often stems from modeling all transitions at a single temporal resolution, which makes it challenging to simultaneously capture fine-grained local dynamics and abstractglobal structures. To this end, we propose SF-RSSM (Slow-Fast Recurrent State-Space Model), a novel method that decouples short-term and long-term dynamics via a dualbranchdesign. The fast branch captures short-horizon transitions using residual prediction, while the slow branch models long-range dependencies with a GRU-based recurrent pathway.A distillation mechanism is developed to enable cooperationacross timescales, with the slow model providing soft targetsto guide the fast model. Additionally, a curiosity module encourages exploration by promoting learning in regions wherethe fast and slow branches exhibit divergent dynamics. Experiments on CARLA, DMControl and Atari benchmarks showthat SF-RSSM outperforms strong baselines in policy performance.

Beyond Single-Speed Reasoning: Coordinating Fast and Slow Dynamics for Efficient World Modeling

Large Language Model (LLM) agents are now widely deployed in Ambient Intelligence (AmI) environments, where autonomous agents must sense, act, and coordinate at scale. As agent capabilities and interdependence increase, traditional reliability strategies such as isolated adaptive control, anomaly detection, or trust modeling have proven inadequate due to their fragmented and scenario-specific nature. Comprehensive architectures that enable integrated self-management, collective anomaly response, robust information dissemination, and privacy-preserving adaptation remain scarce. We propose a bio-autonomic framework for decentralized resilience in multi-agent LLM systems where a unified architecture systematically applies principles from biological autonomic systems to LLM-based multi-agent environments. Specifically, each agent implements an autonomic control loop, formally structured as Monitor-Analyze-Plan-Execute over a shared Knowledge base (MAPE-K), for self-regulation. At the system level, the framework integrates immune-inspired anomaly detection using the Dendritic Cell Algorithm, probabilistic computational trust, decentralized gossip for robust information sharing, and federated learning with homomorphic encryption for collaborative, privacy-preserving adaptation. This holistic approach enables LLM agent ecosystems to self-organize, detect and isolate faults, and collectively adapt as system complexity increases. Empirical evaluations show that our framework achieves substantially improved resilience and recovery compared to state-of-the-art multi-agent baselines.

Downloads

Next from AAAI 2026

UNO! UNified Offline Training Paradigm for Learning Path Recommendation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES