Singapore

Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of the smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning—which selects the appropriate sub-problem from multiple candidates—and solving, which addresses the sub-problem. It means that authentic reasoning has implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in teacher&#39;s reasoning path, which cannot distill this structure to student. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts the reasoning path into multiple meta-reasoning-solving steps and gives the reward to measure the alignment between the reasoning structures of student and teacher. Our RLKD combines this reward with RL, enables the student LLM to internalize the teacher’s implicit multi-branch structure in authentic reasoning, rather than merely mimicking fixed teacher&#39;s output paths. Experiments show that RLKD, even when trained on only 0.1% of the data under an RL-only regime, surpasses the performance of standard SFT-RL pipelines and further unleashes the potential reasoning ability of the student LLM than SFT-based distillation. Code is in supplemental material and will be released after review.

AAAI 2026

RLKD: Distilling LLMs’ Reasoning via Reinforcement Learning

large language models

reasoning

reinforcement learning

Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of the smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning—which selects the appropriate sub-problem from multiple candidates—and solving, which addresses the sub-problem. It means that authentic reasoning has implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in teacher's reasoning path, which cannot distill this structure to student. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts the reasoning path into multiple meta-reasoning-solving steps and gives the reward to measure the alignment between the reasoning structures of student and teacher. Our RLKD combines this reward with RL, enables the student LLM to internalize the teacher’s implicit multi-branch structure in authentic reasoning, rather than merely mimicking fixed teacher's output paths. Experiments show that RLKD, even when trained on only 0.1% of the data under an RL-only regime, surpasses the performance of standard SFT-RL pipelines and further unleashes the potential reasoning ability of the student LLM than SFT-based distillation. Code is in supplemental material and will be released after review.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Video Large Language Models (Video-LLMs) have demonstrated significant potential in the areas of video captioning, search, and summarization. However, current Video-LLMs still face challenges with long real-world videos. Recent methods have introduced a retrieval mechanism that retrieves query-relevant KV caches for question answering, enhancing the efficiency and accuracy of long real-world videos. However, the compression and retrieval of KV caches are still not fully explored. In this paper, we propose \textbf{StreamKV}, a training-free framework that seamlessly equips Video-LLMs with advanced KV cache retrieval and compression. Compared to previous methods that used uniform partitioning, StreamKV dynamically partitions video streams into semantic segments, which better preserves semantic information. For KV cache retrieval, StreamKV calculates a summary vector for each segment to retain segment-level information essential for retrieval. For KV cache compression, StreamKV introduces a guidance prompt designed to capture the key semantic elements within each segment, ensuring only the most informative KV caches are retained for answering questions. Moreover, StreamKV unifies KV cache retrieval and compression within a single module, performing both in a layer-adaptive manner, thereby further improving the effectiveness of streaming video question answering. Extensive experiments on public StreamingVQA benchmarks demonstrate that StreamKV significantly outperforms existing Online Video-LLMs, achieving superior accuracy while substantially improving both memory efficiency and computational latency. The code will be released.

StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression

Predicting spatiotemporal fields governed by partial differential equations (PDEs) from sparse sensor data is a critical and long-standing challenge in science and engineering. Recent deep learning approaches, particularly neural operators, have shown considerable promise in solving PDEs. However, their performance degrades significantly in the demanding regime of extreme sparsity, characterized by spatial sensor coverage of less than 1% and limited temporal observations. To overcome this limitation, we propose a novel framework that decouples the task into two stages: spatial reconstruction and temporal extrapolation. In the first stage, rather than reconstructing the high-dimensional physical field directly, our model learns to reconstruct the complete latent features from sparse observations—features that would otherwise be extracted from a dense field. This process is stabilized by a Vector Quantization (VQ) bottleneck, which discretizes the latent space. In the second stage, a decoder-only Transformer performs temporal extrapolation by autoregressively predicting the future sequence of these discrete latent indices. This design inherently allows the model to generalize to new initial conditions and varying forecast horizons, akin to standard autoregressive models. We validate our framework on three challenging benchmarks, achieving state-of-the-art (SOTA) performance under severe sparsity constraints. Furthermore, we introduce a challenging benchmark dataset based on fire dynamics simulations. On this benchmark, our model successfully forecasts the field's evolution 30 frames into the future from a single timeframe with less than 0.1% spatial observations—a result that pushes well beyond the capabilities of existing methods.

Decoupled Spatiotemporal Forecasting from Extreme Sparse Observations via Quantized Latent Space

Legal reasoning is a fundamental component of legal analysis and decision-making. Existing computational approaches to legal reasoning predominantly rely on generic reasoning frameworks such as syllogism and IRAC, which do not comprehensively examine the nuanced processes that underpin legal reasoning. Moreover, current research has largely focused on criminal cases, with insufficient modeling for civil cases. In this work, we present a novel framework for explicitly modeling legal reasoning in the analysis of Chinese tort-related civil cases. We first operationalize the legal reasoning processes used in tort analysis into the LawChain framework. LawChain is a three-module reasoning framework, with each module consisting of multiple finer-grained sub-steps. Informed by the LawChain framework, we introduce the task of tort legal reasoning and construct an evaluation benchmark, LawChain$_{eval}$, to systematically assess the critical steps within analytical reasoning chains for tort analysis. Leveraging this benchmark, we evaluate state-of-the-art large language models for their legal reasoning ability in civil tort contexts. Our results indicate that current models still fall short in accurately handling crucial elements of tort legal reasoning. Furthermore, we introduce several baseline approaches that explicitly incorporate LawChain-style reasoning through prompting or post-training. We conduct further experiments on additional legal analysis tasks, such as Legal Named-Entity Recognition and Criminal Damages Calculation, to verify the generalizability of these baselines. The proposed baseline approaches achieve significant improvements in tort-related legal reasoning and generalize well to related legal analysis tasks, thus demonstrating the value of explicitly modeling legal reasoning chains to enhance the reasoning capabilities of language models.

LexChain: Modeling Legal Reasoning Chains for Chinese Tort Case Analysis

We present $\textsf{ModularSubsetSelection}$ (MSS), a new algorithm for locally differentially private (LDP) frequency estimation. 
Given a universe of size $k$ and $n$ users, our $\varepsilon$-LDP mechanism encodes each input via a Residue Number System (RNS) over $\ell$ pairwise-coprime moduli $m_0, \ldots, m_{\ell-1}$, and reports a randomly chosen index $j \in [\ell]$ along with the perturbed residue using the statistically optimal $\textsf{SubsetSelection}$ (SS) (Wang et al. 2016).
This design reduces the user communication cost from $\Theta\bigl(\omega \log_2(k/\omega)\bigr)$ bits required by standard SS (with $\omega \approx k/(e^\varepsilon+1)$) down to $\lceil \log_2 \ell \rceil + \lceil \log_2 m_j \rceil$ bits, where $m_j < k$.
Server-side decoding runs in $\Theta(n + r k \ell)$ time, where $r$ is the number of LSMR (Fong and Saunders 2011) iterations. 
In practice, with well-conditioned moduli (i.e., constant $r$ and $\ell = \Theta(\log k)$), this becomes $\Theta(n + k \log k)$. 
We prove that MSS achieves worst-case MSE within a constant factor of state-of-the-art protocols such as SS and $\textsf{ProjectiveGeometryResponse}$ (PGR) (Feldman et al. 2022), while avoiding the algebraic prerequisites and dynamic-programming decoder required by PGR.
Empirically, MSS matches the estimation accuracy of SS, PGR, and $\textsf{RAPPOR}$ (Erlingsson, Pihur, and Korolova 2014) across realistic $(k, \varepsilon)$ settings, while offering faster decoding than PGR and shorter user messages than SS.
Lastly, by sampling from multiple moduli and reporting only a single perturbed residue, MSS significantly reduces a Bayesian attacker’s reconstruction accuracy compared to SS.

Private Frequency Estimation via Residue Number Systems

Multi-agent reinforcement learning (MARL) excels in cooperative and competitive tasks, but most architectures are tied to fixed input-output sizes and require retraining when the number of perceptible or controllable objects changes. While structural generalization techniques mitigate this, they rely on centralized training, raising concerns about scalability and privacy. We propose ADAPT, the first framework to support structural generalization under a decentralized training and decentralized execution (DTDE) paradigm. Every agent adopts an object-centric view, encoding each observed object into a feature vector and aggregating them into a variable-length set representation. To enable each agent to infer task-level contexts from this dynamic input independently, we propose a dynamic-consistency loss that enforces spatio-temporal alignment between context representations and observed environmental dynamics. Agents then condition their policies on the inferred contexts to make locally aligned decisions. For zero-shot transfer, we propose FINE (Foresight INdex for multi-agEnt), a metric that considers Q-value overestimation and enables cross-policy comparison of long-term impact, facilitating effective policy transfer. Experiments show that ADAPT surpasses existing DTDE methods and outperforms CTDE baselines in zero-shot generalization.

ADAPT: Adaptive Decentralized Architecture with Perception-Aligned Training for Structural Generalization in Multi-Agent RL

Plan verification is the task of checking whether a proposed plan correctly solves a given planning problem. In Hierarchical Task Network (HTN) planning, this verification problem is known to be NP-complete, even in the grounded setting. Existing approaches to HTN plan verification range from SAT encodings to parser-based techniques. However, the temporal structure that emerges from hierarchical decomposition has not yet been used explicitly as a basis for verification.
In this paper, we establish a formal connection between HTN planning and temporal reasoning by showing how decomposition structures can be naturally represented using qualitative constraint networks. Building on this insight, we present a new top-down encoding that transforms the verification of partially ordered task networks into a temporal reasoning problem. We prove the correctness of this encoding and explain how it accounts for both the hierarchical and temporal aspects of HTN plans.
By linking HTN plan verification with qualitative temporal reasoning, our approach introduces a principled formal framework for reasoning about complex temporal relationships in hierarchical plans. This connection offers new perspectives for knowledge representation in structured planning domains.

HTN Plan Verification by Qualitative Temporal Reasoning

We introduce DSCodeBench, a new benchmark designed to evaluate large language models (LLMs) on complicated and realistic data science code generation tasks.
DSCodeBench consists of 1,000 carefully constructed problems sourced from realistic problems from GitHub across ten widely used Python data science libraries.
DSCodeBench offers a more challenging and representative testbed, more complex code solutions, more comprehensive data science libraries, clearer and better structured problem descriptions, and stronger test suites.
To construct the DSCodeBench, we develop a robust pipeline that combines task scope selection, code construction, test case generation, and problem description synthesis.
The process is paired with rigorous manual editing to ensure alignment and enhance the reliability of the evaluation.
Experimental result shows that DSCodeBench exhibits robust scaling behavior, where larger models systematically outperform smaller ones, validating its ability to distinguish model capabilities.
The best LLM we test, GPT-4o, has a pass@1 of 0.392, indicating that LLMs still have a large room to improve for realistic data science code generation tasks. 
We believe DSCodeBench will serve as a rigorous and trustworthy foundation for advancing LLM-based data science programming.

DSCodeBench: A Realistic Benchmark for Data Science Code Generation

Learning rate scheduling is crucial for training large language models, yet understanding the optimal annealing strategies across different model configurations remains challenging. In this work, we investigate the transferability of annealing dynamics in large language model training and refine a generalized predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules. Our work provides a practical guidance for selecting optimal annealing strategies without exhaustive hyperparameter searches, demonstrating that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models. We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models, demonstrating that optimal annealing ratios follow consistent patterns and can be transferred across different training configurations.

Scaling and Transferability of Annealing Strategies in Large Language Model Training

Diffusion models have achieved impressive generative performance across diverse domains such as image, video, and scientific data generation. However, fine-tuning these models for new tasks remains challenging due to their large scale, architectural diversity, and high sensitivity to hyperparameters—particularly learning rates. In this work, we propose Wasserstein-Aware Transfer (WAT), a principled and effective fine-tuning strategy grounded in diffusion trajectory analysis and optimal transport theory. Our key insight is that the distributional discrepancies between diffusion trajectories from different datasets decrease progressively over time and converge near the noise end. Based on this observation, we introduce a class-wise matching mechanism that minimizes the Wasserstein distance between class distributions of source and target datasets. This enables alignment at the class level without modifying the standard fine-tuning pipeline. To further enhance knowledge retention, we propose a novel sampling strategy that linearly combines class-conditional outputs from both pretrained and fine-tuned models. This method is simple yet effective, requiring negligible computational overhead while preserving domain-specific and generalizable knowledge. Extensive experiments across seven diverse benchmarks demonstrate that WAT reliably enhances generation quality under distribution shifts, outperforming competitive baselines. These results underscore its robustness and affirm the potential of optimal transport as a principled basis for knowledge transfer in diffusion models.

Wasserstein-Aware Transfer: Class-Level Alignment for Robust Diffusion Model Adaptation

In a public goods game, every player chooses whether or not to produce a good that all neighboring players will have access to. We consider a setting in which the public good is indivisible, neighboring players are out-neighbors in a directed graph, and there is a capacity constraint on their number, $k$, that can benefit from the good. This means that each player makes a two-pronged decision: decide whether or not to produce and, conditional on producing, choose which $k$ out-neighbors to share access. We examine both pure and mixed Nash equilibria in the model from the perspective of existence, computation, and efficiency. We perform a comprehensive study for these three dimensions with respect to both sharing capacity ($k$) and the network structure (the underlying directed graph), and establish sharp complexity dichotomies for each.

Content not yet available

Next from AAAI 2026

StreamKV: Streaming Video Question-Answering with Segment-based KV Cache Retrieval and Compression

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES