Singapore

Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) have demonstrated substantial success on complex reasoning tasks, leveraging scaled inference computation. However, the sparse and one-sided reward, focused solely on final correctness, limit its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce a self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only &quot;simple&quot; samples, defined by the model&#39;s consistent correctness, are rewritten, thereby preserving all original loss of GRPO. For practical implementation, we compile rewriting and vanilla generating within one single batch, maintaining the scalability of the RL algorithm and introducing only 10\% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46\%) even without explicit instructions to truncate reasoning, outperforming exsiting strong baselines. In terms of internal quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric. All relevant code and data will be released.

AAAI 2026

Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement

unsupervised & self-supervised learning

large language models

reinforcement learning

Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) have demonstrated substantial success on complex reasoning tasks, leveraging scaled inference computation. However, the sparse and one-sided reward, focused solely on final correctness, limit its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce a self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten, thereby preserving all original loss of GRPO. For practical implementation, we compile rewriting and vanilla generating within one single batch, maintaining the scalability of the RL algorithm and introducing only 10\% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46\%) even without explicit instructions to truncate reasoning, outperforming exsiting strong baselines. In terms of internal quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric. All relevant code and data will be released.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Large Language Models (LLMs) surprised the world with their ability to mimic humans in writing and are starting to be used as simulations of human writers for various kinds of linguistic analysis. However, these analyses rest on the belief that LLMs are good density models, that accurately capture the underlying probability distribution of the language. In this paper, we question this basic assumption and try to evaluate language models on their density modelling capabilities. Since a ground truth does not exist for the probability distribution of any natural language, we come up with a synthetic language made up of decimal numbers written in words in English. We train language models from scratch on various probability distributions over this synthetic language and compare the distributions learned by the models with the original ones. Experiments show that language models can learn underlying probability distributions across a wide range of cases, but they fail when those distributions depend on deep semantic properties of numbers that cannot be inferred from syntactic patterns. Additionally, we observed a strong bias in the models toward numbers that frequently occur as substrings within other numbers. In natural language models, this bias can impact downstream tasks that rely on model-generated probabilities.

Are Language Models Any Good at Density Modeling?

Reinforcement Learning (RL) has shown significant promise in automated portfolio management; however, effectively balancing risk and return remains a central challenge, as many models fail to adapt to dynamically changing market conditions. In this paper, we propose MARS, a novel hierarchical deep RL framework designed to explicitly address this limitation through a multi-agent, risk-aware approach. Instead of a single monolithic model, MARS employs a Heterogeneous Agent Ensemble where each agent possesses a unique, intrinsic risk profile. This profile is enforced by a dedicated Safety-Critic network and a specific risk-tolerance threshold, allowing agents to specialize in behaviors ranging from capital preservation to aggressive growth. To navigate different market regimes, a high-level Meta-Adaptive Controller (MAC) learns to dynamically orchestrate the ensemble. By adjusting its reliance on conservative versus aggressive agents, the MAC effectively lowers portfolio volatility during downturns and seeks higher returns in bull markets, thus minimizing maximum drawdown and enhancing overall stability. This two-tiered structure allows MARS to generate a disciplined and adaptive portfolio that is robust to market fluctuations. The framework achieves a superior balance between risk and return by leveraging behavioral diversity rather than explicit market-feature engineering. Experiments on major international stock indexes, including periods of significant market volatility and downturns, demonstrate the efficacy of our framework on risk-adjusted criteria, significantly reducing maximum drawdown and volatility while maintaining competitive returns.

MARS: A Meta-Adaptive Reinforcement Learning Framework for Risk-Aware Multi-Agent Portfolio Management

Large Language Models excel at semantic reasoning through high-dimensional geometry but systematically struggle with numerical reasoning due to the destruction of geometric continuity during tokenization. Traditional tokenization methods fragment numerical values into arbitrary tokens, undermining their inherent geometric and topological relationships. We introduce GeoNum, a geometrically coherent numerical embedding that addresses this fundamental challenge through polar coordinate representation. GeoNum employs polar decomposition to naturally decouple discrete ordinality for classification from continuous periodicity for regression, enabling unified discrete-continuous learning that preserves numerical cognition's dual nature. Through three-stage progressive training, GeoNum first learns continuous numerical representations via self-supervised reconstruction, then aligns these embeddings with textual representations through projection learning, and finally integrates into pre-trained LLMs via parameter-efficient fine-tuning. Empirical evaluations demonstrate GeoNum consistently surpasses baseline and state-of-the-art numerical encoding methods across multiple datasets, achieving substantial performance gains particularly in high-precision arithmetic tasks (e.g., ACC@0.1 improvements up to 48.6\%). GeoNum transforms numerical processing from fragmented tokenization to coherent geometric representation, enabling principled numerical understanding in language models.

GeoNum: Bridging Numerical Continuity and Language Semantics via Geometric Embedding

Recent advancements in Audio-Video Large Language Models (AV-LLMs) have enhanced its capabilities in tasks like audio-visual question answering, and multimodal dialogue systems. 
Video and audio introduce an extended temporal dimension, resulting in a larger key-value (KV) cache compared to static image embedding. 
A naive optimization strategy is to selectively focus on and retain KV caches of audio or video based on task. However, in the experiment, we observed that the attention of AV-LLMs to various modalities in the high layers is not strictly dependent on the task. At a deeper level, the attention of AV-LLMs shifts more towards the video modality. In addition, we also found that directly integrating temporal KV of audio and spatial-temporal KV of video may lead to information confusion and significant performance degradation of AV-LLMs. If audio and video are processed indiscriminately, it may also lead to excessive compression or reservation of a certain modality, thereby disrupting the alignment between modalities. 
To address these challenges, we propose AccKV, an Adaptive-Focusing and Cross-Calibration KV cache optimization framework designed specifically for efficient AV-LLMs inference. Our method is based on layer adaptive focusing technology, selectively focusing on key modalities according to the characteristics of different layers, and enhances the recognition of heavy hitter tokens through attention redistribution. In addition, we propose a Cross-Calibration technique that first integrates inefficient KV caches within the audio and video modalities, and then aligns low-priority modalities with high-priority modalities to selectively evict KV cache of low-priority modalities. 
The experimental results show that AccKV can significantly improve the computational efficiency of AV-LLMs while maintaining accuracy.

AccKV: Towards Efficient Audio-Video LLMs Inference via Adaptive-Focusing and Cross-Calibration KV Cache Optimization

Decision-focused learning (DFL) has emerged as a powerful end-to-end alternative to conventional predict-then-optimize (PTO) pipelines by directly optimizing predictive models through downstream decision losses. Existing DFL frameworks are limited by their strictly sequential structure, referred to as sequential DFL (S-DFL). However, S-DFL fails to capture the bidirectional feedback between prediction and optimization in complex interaction scenarios. In view of this, we first time propose recursive decision-focused learning (R-DFL), a novel framework that introduces bidirectional feedback between downstream optimization and upstream prediction. We further extend two distinct differentiation methods: explicit unrolling via automatic differentiation and implicit differentiation based on fixed-point methods, to facilitate efficient gradient propagation in R-DFL. We rigorously prove that both methods achieve comparable gradient accuracy, with the implicit method offering superior computational efficiency. Extensive experiments on both synthetic and real-world datasets, including the newsvendor problem and the bipartite matching problem, demonstrate that R-DFL not only substantially enhances the final decision quality over sequential baselines but also exhibits robust adaptability across diverse scenarios in closed-loop decision-making problems.

From Sequential to Recursive: Enhancing Decision-Focused Learning with Bidirectional Feedback

Identifying in-vehicle electronic control units based on voltage characteristics has been the subject of extensive research in cybersecurity. However, the results reported so far generally depend on restricted datasets and supervised learning. In this work, we show that clustering, i.e., unsupervised learning, of voltage characteristics, is in fact more challenging when done on a larger pool of electronic control units as several out-of-the-box clustering methods and metrics will fail to determine the correct number of clusters when exerted over a large dataset. To overcome this issue, we propose a new methodology that takes advantage of domain-specific constraints, which guide the search toward the correct number of electronic control units in a car, or even in a larger pool of units from several cars. We introduce two new metrics: correctness, which measures the success ratio with respect to the constraints, and divergence, which measures the consistency of the clustering, and show that they provide a strong indication for the optimal number of clusters. In this specific context, both metrics prove to be more reliable than the widely used Silhouette score, Davies-Bouldin and Calinski-Harabas indexes. We successfully test our methodology on the largest dataset available today for in-vehicle voltage characteristics and discover new insights regarding the number of devices.

Constraint-Guided Clustering for Identifying in-Vehicle Electronic Control Units from Voltage Data

Accurate and efficient modeling of soft-tissue interactions is fundamental for advancing surgical simulation, surgical robotics, and model-based surgical automation. To achieve real-time latency, classical Finite Element Method (FEM) solvers are often replaced with neural approximations; however, naively training such models in a fully data-driven manner without incorporating physical priors frequently leads to poor generalization and physically implausible predictions. We present a novel physics-informed neural simulation framework that enables real-time prediction of soft-tissue deformations under complex single- and multi-grasper interactions. Our approach integrates Kelvinlet-based analytical priors with large-scale FEM data, capturing both linear and nonlinear tissue responses. This hybrid design improves predictive accuracy and physical plausibility across diverse neural architectures while maintaining the low-latency performance required for interactive applications. We validate our method on challenging surgical manipulation tasks involving standard laparoscopic grasping tools, demonstrating substantial improvements in deformation fidelity and temporal stability over existing baselines. These results establish Kelvinlet-augmented learning as a principled and computationally efficient paradigm for real-time, physics-aware soft-tissue simulation in surgical AI. Our code and data is available at: \url{https://github.com/Anon92373/Neural-Kelvinlet}.

Neural-Augmented Kelvinlet for Real-Time Soft Tissue Deformation Modeling

Map matching for sparse trajectories is a fundamental problem for many trajectory-based applications, e.g., traffic scheduling and traffic flow analysis. Existing methods for map matching are generally based on Hidden Markov Model (HMM) or encoder-decoder framework. However, these methods continue to face significant challenges when handling noisy or sparsely sampled GPS trajectories. To address these limitations, we propose DiffMM, an encoder–diffusion-based map matching framework that produces effective yet efficient matching results through a one-step diffusion process. We first introduce a road segment-aware trajectory encoder that jointly embeds the input trajectory and its surrounding candidate road segments into a shared latent space through an attention mechanism. Next, we propose a one step diffusion method to realize map matching through a shortcut model by leveraging the joint embedding of the trajectory and candidate road segments as conditioning context. We conduct extensive experiments on large-scale trajectory datasets, demonstrating that our approach consistently outperforms state-of-the-art map matching methods in terms of both accuracy and efficiency, particularly for sparse trajectories and complex road network topologies.

DiffMM: Efficient Method for Accurate Noisy and Sparse Trajectory Map Matching via One Step Diffusion

The fair division of indivisible goods is not only a subject of theoretical research, but also an important problem in practice, with solutions being offered on several online platforms. Little is known, however, about the characteristics of practical fair-division instances and how they compare to the characteristics of synthetic fair-division instances. Taking inspiration from the work of Szufa et al.~(2020), we devise a map of fair-division instances. This map identifies two key axes along which fair-division instances differ, which help distinguish synthetic distributions, predict various features of the fair-division instances, and can be conceptually interpreted.

Putting Fair Division on the Map

Motivated by the increasing risks of data misuse and fabrication, we investigate
the problem of identifying synthetic time series generated by Time-Series Large
Models (TSLMs) in this work.
While there are extensive researches on detecting model generated text, we find
that these existing methods are not applicable to time series data due to the
fundamental modality difference, as time series usually have lower information
density and smoother probability distributions than text data, which limit the
discriminative power of token-based detectors.
To address this issue, we examine the subtle distributional differences between
real and model-generated time series and propose the contraction
hypothesis, which states that model-generated time series, unlike real ones,
exhibit progressively decreasing uncertainty under recursive forecasting.
We formally prove this hypothesis under theoretical assumptions on model
behavior and time series structure.
Model-generated time series exhibit progressively concentrated distributions
under recursive forecasting, leading to uncertainty contraction.
We provide empirical validation of the hypothesis across diverse datasets.
Building on this insight, we introduce the Uncertainty Contraction
Estimator (UCE), a white-box detector that aggregates uncertainty metrics
over successive prefixes to identify TSLM‑generated time series.
Extensive experiments on $32$ datasets show that UCE consistently outperforms
state-of-the-art baselines, offering a reliable and generalizable solution for
detecting model-generated time series.

Content not yet available

Next from AAAI 2026

Are Language Models Any Good at Density Modeling?

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES