Singapore

To date, distributional reinforcement learning (distributional RL) methods have exclusively focused on the discounted setting, where an agent aims to optimize a potentially-discounted sum of rewards over time. In this work, we extend distributional RL to the average-reward setting, where an agent aims to optimize the reward received per time-step. In particular, we utilize a quantile-based approach to develop the first set of algorithms that can successfully learn and/or optimize the long-run per-step reward distribution, as well as the differential return distribution of an average-reward MDP. We derive proven-convergent tabular algorithms for both prediction and control, as well as a broader family of algorithms that have appealing scaling properties. Empirically, we find that these algorithms yield competitive and sometimes superior performance when compared to their non-distributional equivalents, while also capturing rich information about the long-run per-step reward and differential return distributions.

AAAI 2026

A Differential Perspective on Distributional Reinforcement Learning

distributional reinforcement learning

machine learning

reinforcement learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Human-defined creativity is highly abstract, posing a challenge for multimodal large language models (MLLMs) to comprehend and assess creativity that aligns with human judgments. The absence of an existing benchmark further exacerbates this dilemma. To this end, we propose CreBench, which consists of two key components: 1) an evaluation benchmark covering the multiple dimensions from creative idea to process to products; 2) CreMIT (Creativity Multimodal Instruction Tuning dataset), a multimodal creativity evaluation dataset, consisting of 2.2K diverse-sourced multimodal data, 79.2K human feedbacks and 4.7M multi-typed instructions. Specifically, to ensure MLLMs can handle diverse creativity-related queries, we prompt GPT to refine these human feedbacks to activate stronger creativity assessment capabilities. CreBench serves as a foundation for building MLLMs that understand human-aligned creativity. Based on the CreBench, we fine-tune open-source general MLLMs, resulting in CreExpert, a multimodal creativity evaluation expert model. Extensive experiments demonstrate that the proposed CreExpert models achieve significantly better alignment with human creativity evaluation compared to state-of-the-art MLLMs, including the most advanced GPT-4V and Gemini-Pro-Vision. To our knowledge, we are the first to propose a benchmark for creativity evaluation.

CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

Trip recommendation aims to generate a sequence of points of interest (POIs) under a user's query input. Existing data-driven methods mainly fall into two categories: supervised approaches and self-supervised approaches. The former cannot fully capture the transition patterns among POIs, while the latter fail to comprehensively model user's query intents.
Fortunately, privileged knowledge distillation (PKD) provides us an unique opportunity to align user's query intents with its corresponding trip in historical data. However, such knowledge alignment is implicit, which may not directly reflect the query intents. To this end, in this paper, we propose EKD-Trip, an explicit intent-enhanced knowledge distillation framework. EKD-Trip first trains a trajectory encoder (teacher model) and a trip generator jointly in a self-supervised manner. Then, a query encoder (student model) is trained via multi-task learning to extract implicit knowledge by PKD from teacher and explicit knowledge from an auxiliary task, respectively. At inference time, we use the query encoder and the trip generator to recommend trips. Extensive experiments on four real-world datasets demonstrate that EKD-Trip outperforms all baselines over three metrics, with a particularly notable improvement of 13.70% in pairs-F1.

Explicit Intent-Enhanced Knowledge Distillation for Trip Recommendation

Numerical reasoning over documents, which demands both contextual understanding and logical inference, is challenging for low-capacity local models deployed on computation-constrained devices. Although such complex reasoning queries could be routed to powerful remote models like GPT-4, exposing local data raises significant data leakage concerns. Existing mitigation methods generate problem descriptions or examples for remote assistance. However, the inherent complexity of numerical reasoning hinders the local model from generating logically equivalent queries and accurately inferring answers with remote guidance. In this paper, we present a model collaboration framework with two key innovations: (1) a context-aware synthesis strategy that shifts the query topics while preserving reasoning patterns; and (2) a tool-based answer reconstruction approach that reuses the remote-generated plug-and-play solution with code snippets. Experimental results demonstrate that our method achieves better reasoning accuracy than solely using local models while providing stronger data protection than fully relying on remote models. Furthermore, our method improves accuracy by 16.2\% - 43.6\% while reducing data leakage by 2.3\% - 44.6\% compared to existing data protection approaches.

Collaborative LLM Numerical Reasoning with Local Data Protection

Personalized image generation is crucial for improving the user experience, as it renders reference images into preferred ones according to user visual preferences. Although effective, existing methods face two main issues. First, existing methods treat all items in the user's historical sequence equally when extracting user preferences, overlooking the varying semantic similarities between historical items and the reference item. Disproportionately high weights for low-similarity items distort user visual preferences for the reference item. Second, existing methods heavily rely on consistency between generated and reference images to optimize generation, which leads to underfitting user preferences and hinders personalization. To address these issues, we propose Retrieval Augmented Personalized Image GenerAtion guided by Recommendation (RAGAR). Our approach uses a retrieval mechanism to assign different weights to historical items according to their similarities to the reference item, thereby extracting more refined users' visual preferences for the reference item. Then we introduce a novel rank task based on the multi-modal ranking model to optimize the personalization of the generated images instead of forcing depend on consistency. Extensive experiments and human evaluations on three real-world datasets demonstrate that RAGAR achieves significant improvements in both personalization and semantic metrics compared to five baselines.

RAGAR: Retrieval Augmented Personalized Image Generation Guided by Recommendation

Traditional text-to-motion frameworks often lack precise control, and existing approaches based on joint keyframe locations provide only positional guidance, making it challenging and unintuitive to specify body part orientations and motion timing. To address these limitations, we introduce the Salient Orientation Symbolic (SOS) script, a programmable symbolic framework for specifying body part orientations and motion timing at keyframes.
We further propose an automatic SOS extraction pipeline that employs temporally-constrained agglomerative clustering for frame saliency detection and a Saliency-based Masking Scheme (SMS) to generate sparse, interpretable SOS scripts directly from motion data. Moreover, we present the SOSControl framework, which treats the available orientation symbols in the sparse SOS script as salient and prioritizes satisfying these constraints during motion generation. By incorporating SMS-based data augmentation and gradient-based iterative optimization, the framework enhances alignment with user-specified constraints. Additionally, it employs a ControlNet-based ACTOR-PAE Decoder to ensure smooth and natural motion outputs.
Extensive experiments demonstrate that the SOS extraction pipeline generates human-interpretable scripts with symbolic annotations at salient keyframes, while the SOSControl framework outperforms existing baselines in motion quality, controllability, and generalizability with respect to motion timing and body part orientation control.

SOSControl: Enhancing Human Motion Generation Through Saliency-Aware Symbolic Orientation and Timing Control

Despite the rich spatiotemporal patterns contained in trajectory data from multiple Location-Based Social Network (LBSN) platforms, heterogeneous formats, semantic inconsistencies, and unequal user scales across platforms create substantial barriers to reliable identity mapping. Furthermore, GPS drift and sparse sampling result in degraded data quality and distribution imbalance, which render existing trajectory representation methods inadequate for capturing high-order dependencies and dynamic spatiotemporal evolution patterns in heterogeneous multi-relational graphs.
To this end, we propose HANCUA (Hierarchical Attention Network with Correction for User Association), a novel framework that employs a dual-stage correction mechanism to enhance cross-domain trajectory analysis. The approach constructs hierarchical multi-relational graphs comprising location, trajectory, and correction layers to capture fine-grained mobility patterns, behavioral associations, and inter-platform distribution differences. We design relation-aware multi-head graph attention networks to model complex interactions among heterogeneous node types, which enables comprehensive spatial relationship modeling. A spatiotemporal semantic collaborative learning module integrates temporal information with mobility patterns through interaction-aware attention mechanisms, while an ensemble correction decision module incorporates ensemble learning principles to systematically correct user association biases and address distribution imbalance problems.
Extensive experiments on two real-world LBSN cross-domain datasets reveals that HANCUA significantly outperforms state-of-the-art methods in user identity linking accuracy.

Hierarchical Attention Network with Correction for Cross-Domain User Association

As a knowledge-intensive and challenging task, automatic generation of long-form wiki-style articles has garnered increasing attention from researchers due to its ability to efficiently integrate, organize and present vast amounts of both structured and unstructured knowledge. 
To the best of our knowledge, most of the existing mainstream state-of-the-art methods for automatic wiki-style article generation typically follow a "one-shot generation" paradigm: given a topic, (1) first generating a structured outline, (2) then independently and in parallel generating the content of each outline chapter in a one-shot using the chapter title and references. However, the core limitation of the paradigm lies in its disregards inter-chapter correlation and lacks post-generation revision and refinement, resulting in content redundancy, weak relevance and logical inconsistency. To address these issues, we propose WikiREVIEW, a novel multi-perspective review framework for automatic wiki-style article generation. Specifically, our proposed method introduces multi-perspective experts to review the content of each outline chapter at both chapter and paragraph levels following the initial generation, offering evaluation feedback and continuously refining the numerous deficiencies in the initial long-form article, ultimately achieving high-quality wiki-style article generation. Extensive experimental results on the public English dataset FreshWiki and our own constructed high-quality Chinese dataset ChineseWiki, demonstrate that our proposed WikiREVIEW significantly outperforms existing state-of-the-art automatic wiki-style article generation methods across all automatic evaluation metrics and human evaluation.

WikiREVIEW: A Multi-Perspective Review Framework for Automatic Wiki-Style Article Generation

Audio synthesis has broad applications in multimedia. Recent advancements have made it possible to generate relevant audios from inputs describing an audio scene, such as images or texts. However, the immersiveness and expressiveness of the generation are limited. One possible problem is that existing methods solely rely on the global scene and overlook details of local sounding objects (i.e., sound sources). To address this issue, we propose a Sound Source-Aware Audio (SS2A) generator. SS2A is able to locally perceive multimodal sound sources from a scene with visual detection and cross-modality translation. It then contrastively learns a Cross-Modal Sound Source (CMSS) Manifold to semantically disambiguate each source. Finally, we attentively mix their CMSS semantics into a rich audio representation, from which a pretrained audio generator outputs the sound. To model the CMSS manifold, we curate a novel single-sound-source visual-audio dataset VGGS3 from VGGSound. We also design a Sound Source Matching Score to clearly measure localized audio relevance. With the effectiveness of explicit sound source modeling, SS2A achieves state-of-the-art performance in extensive image-to-audio tasks. We also qualitatively demonstrate SSV2A's ability to achieve intuitive synthesis control by compositing vision, text, and audio conditions. Furthermore, we show that our sound source modeling can achieve competitive video-to-audio performance with a straightforward temporal aggregation mechanism.

Gotta Hear Them All: Towards Sound Source Aware Audio Generation

Memory behavior modeling seeks to predict individual recall performance and understand its underlying cognitive mechanisms. However, the dynamic and heterogeneous nature of memory data poses significant challenges to the generalization ability of models under unseen conditions. To address this challenge, we propose an invariant representation learning framework I-Mem that integrates self-supervised contrastive learning with decorrelation constraints, enabling the adaptive identification and suppression of environment-related factors in sequential behavioral data, thereby mitigating the influence of spurious features and enhancing the modeling of stable cognitive structures. Importantly, the method does not rely on explicit environment partitioning or predefined environment labels, while our theoretical analysis demonstrates that it can effectively resist environmental perturbations and facilitate the extraction of invariant structural representations, thereby ensuring adaptability and generalization. Empirical evaluations on both synthetic and real-world datasets further confirm its superiority over mainstream methods in terms of generalization performance and stable feature identification. Feature attribution analysis reveals that I-Mem extracts invariant features aligned with classical cognitive effects, and reflects short-term behavioral patterns that may indicate latent cognitive mechanisms beyond existing theories, highlighting both interpretability and discovery potential.

Invariant Representation Learning for Memory Behavior Modeling via Adaptive Environment Separation

Neural Radiance Fields (NeRF)-based Visual Simultaneous Localization and Mapping (SLAM) achieve superior scene geometric modeling and robust camera tracking by leveraging neural representations. 
Existing methods typically relied on multi-resolution hash encoding with truncated signed distance fields (TSDF) to achieve high frame rates. However, unavoidable hash collisions can lead to artifacts, and multi-view color inconsistencies in indoor scenes can result in shape-radiance ambiguity, adversely affecting geometric quality and tracking accuracy.
To address these issues, we propose a novel Multi-scale Hybrid Encoding-based Decoupled SLAM (MHED-SLAM). 
First, to mitigate the adverse effects of hash collisions and reduce the number of learnable parameters, we innovatively fuse a coarse-scale hash tri-plane with a fine-scale hash grid within a single latent volume. 
Second, to enable precise geometric reconstruction and camera tracking, we decouple the reconstruction and rendering processes, independently learning a TSDF field for reconstruction and a density field for rendering.
Third, we devise a Symmetric Kullback-Leibler (SKL) strategy based on ray termination distributions to align the probability distributions derived from the TSDF and density fields for their synchronous convergence. 
Extensive experimental evaluations demonstrate that our approach surpasses the state-of-the-art (SOTA) methods by utilizing a faster frame rate of 20 Hz and fewer parameters, while achieving higher tracking and reconstruction accuracy.

Content not yet available

Next from AAAI 2026

CreBench: Human-Aligned Creativity Evaluation from Idea to Process to Product

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES