Singapore

Large reasoning models (LRMs) improve performance at test time by \emph{thinking longer}, but this often leads to overthinking and high computational cost. To address this, recent reinforcement learning (RL) methods adopt mechanical outcome-level rewards (e.g., rule- or prompt-based) that favor shorter correct paths, but frequently overlook reasoning quality. While such rewards neglect intermediate reasoning, dense supervision from process reward models (PRMs) has proven more effective in promoting coherent and high-quality reasoning. However, static PRM supervision introduces two challenges: \textit{reward hacking}, as fixed rewards poorly capture global reasoning objectives, and \textit{the high training cost} of obtaining dense reward labels at scale. To overcome these issues, we propose step group relative policy optimization (\textbf{Step-GRPO}), a GRPO-based method that integrates step-level PRM signals into sparse trajectory-level feedback, avoiding costly step-level supervision while enhancing reasoning quality beyond accuracy. In addition, Step-GRPO employs a step-attention mechanism that captures inter-step dependencies and emphasizes critical reasoning steps, effectively mitigating reward hacking. We apply Step-GRPO to train LLMs, achieving consistent gains in reasoning quality, accuracy, and shorter reasoning traces across multiple math benchmarks, outperforming RL baselines at substantially lower cost. Notably, the proposed model achieves 36.7\% accuracy on AIME’24 with 11K samples and ＄38 training cost, surpassing ＄1000+ baselines trained on 40K+ samples, demonstrating strong cost-effectiveness and scalability.

AAAI 2026

Step-GRPO: Enhancing Reasoning Quality and Efficiency via Structured PRM-Based Reinforcement Learning

large language models

reasoning

reinforcement learning

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

In recent years, deep multi-agent reinforcement learning (MARL) has demonstrated remarkable potential in solving complex cooperative tasks by enabling decentralized yet efficient coordination among agents. However, during decentralized training, agent policy updates induced by different joint action samples may conflict, leading to gradient interference that hinders convergence and the emergence of coordinated behavior. In this paper, we analyze and empirically validate the phenomenon of gradient interference. To address this, we then propose Gradient-Protected Value Decomposition (GPVD), a novel MARL framework that explicitly protects the gradient signals of optimal collaborative actions by suppressing the impact of interfering actions. GPVD employs a dynamic gradient protection mechanism that identifies optimal collaborative joint actions and reweights the loss to attenuate gradients from non-collaborative interfering actions. To effectively identify high-value collaborative actions, we apply SimHash-based state grouping to discover consistent collaboration patterns across similar states. Furthermore, a count-based intrinsic reward is incorporated to encourage exploration and improve the coverage of potentially optimal joint actions. Experiments on challenging multi-agent benchmarks demonstrate that GPVD achieves faster convergence, stronger coordination, and greater training stability compared to state-of-the-art value decomposition methods.

Gradient-Protected Value Decomposition for Cooperative Multi-Agent Reinforcement Learning

Despite the remarkable developments achieved by recent 3D generation works, scaling these methods to geographic extents, such as modeling thousands of square kilometers of Earth’s surface, remains an open challenge.
We address this through a dual innovation in data infrastructure and model architecture.
First, we introduce Aerial-Earth3D, the largest 3D aerial dataset to date, consisting of 50k curated scenes (each measuring 600m$\times$600m) captured across the U.S. mainland, comprising 45M multi-view Google Earth frames.
Each scene provides pose-annotated multi-view images, depth maps, normals, semantic segmentation, and camera poses, with explicit quality control to ensure terrain diversity.
Building on this foundation, we propose EarthCrafter, a tailored framework for large-scale 3D Earth generation via sparse-decoupled latent diffusion. Our architecture separates structural and textural generation:
1) Dual sparse 3D-VAEs compress high-resolution geometric voxels and textural 2D Gaussian Splats (2DGS) into compact latent spaces, largely alleviating the costly computation suffering from vast geographic scales while preserving critical information. 
2) We propose condition-aware flow matching models trained on mixed inputs (semantics, images, or neither) to flexibly model latent geometry and texture features independently.
Extensive experiments demonstrate that EarthCrafter performs substantially better in extremely large-scale generation.
The framework further supports versatile applications, from semantic-guided urban layout generation to unconditional terrain synthesis, while maintaining geographic plausibility through our rich data priors from Aerial-Earth3D.

EarthCrafter: Scalable 3D Earth Generation via Dual-Sparse Latent Diffusion

Recently, Chain-of-Thought (CoT) reasoning has significantly enhanced the capabilities of large language models (LLMs), but Vision–Language Models (VLMs) still struggle with multi-step reasoning tasks due to limited multimodal reasoning data. To bridge this gap, researchers have explored methods to transfer CoT reasoning from LLMs to VLMs. However, existing approaches either need high training costs or require architectural alignment. In this paper, we use Linear Artificial Tomography (LAT) to empirically show that LLMs and VLMs share similar low-frequency latent representations of CoT reasoning despite architectural differences. Based on this insight, we propose **L2V-CoT**, a novel training-free latent intervention approach that transfers CoT reasoning from LLMs to VLMs. **L2V-CoT** extracts and resamples low-frequency CoT representations from LLMs in the frequency domain, enabling dimension matching and latent injection into VLMs during inference to enhance reasoning capabilities. Extensive experiments demonstrate that our approach consistently outperforms training-free baselines and even surpasses supervised methods.

L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention

Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model's capabilities, enabling it to handle diverse downstream tasks.Extensive experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24\% gain on general tasks and 8.92\% on specialized tasks like math reasoning and code generation. Code will be made available at supplementary materials.

Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model capabilities. This paper provides a critical analysis of benchmark effectiveness, examining mainstream prominent LLM benchmarks using results from diverse models. We first propose Pseudo-Siamese Network for Item Response Theory (PSN-IRT), an enhanced Item Response Theory framework that incorporates a rich set of item parameters within an IRT-grounded architecture. PSN-IRT can be utilized for accurate and reliable estimations of item characteristics and model abilities. Based on PSN-IRT, we conduct extensive analysis on 11 LLM benchmarks comprising 41,871 items, revealing significant and varied shortcomings in their measurement quality. Furthermore, we demonstrate that leveraging PSN-IRT is able to construct smaller benchmarks while maintaining stronger alignment with human preference.

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.

Cheating Stereo Matching in Full-Scale: Physical Adversarial Attack Against Binocular Depth Estimation in Autonomous Driving

In this paper, we investigate the application of heuristics based on Graph Neural Networks (GNNs) to lifted numeric
planning problems, an area that has been relatively unexplored. Building upon the GNN approach for learning general
policies proposed by Staahlberg et al., we extend the architecture to make it sensitive to the numeric components inherent in the planning problems we address. We achieve this by observing that, although the state space of a numeric planning problem is infinite, the finite subgoal structure of the problem can be incorporated into the architecture, enabling the construction of a finite structure. Instead of learning general policies, we train our models to serve as heuristics within a best-first search algorithm. We explore various configurations of this architecture and demonstrate that the resulting heuristics are highly informative and, in certain domains, offer a better trade-off between guidance and computational cost compared to state-of-the-art heuristics.

Learning Heuristic Functions with Graph Neural Networks for Numeric Planning

Theory of Mind (ToM) refers to the ability to infer others' mental states, which is an essential capability for embodied AI agents to effectively collaborate and interact with humans. While improving Large Language Models' ability to reason about characters' mental states in text-based stories/dialogues has been extensively studied, enhancing Multimodal Large Language Models' ToM capabilities, particularly in egocentric video from an embodied perspective, remains unexplored. In this paper, we propose a contrastive Reinforcement Learning (RL) paradigm that explicitly encourages models to leverage temporal and causal evolutionary patterns in user action sequences to infer user's mental states (goals, beliefs, and potential next actions). Evaluation results on in-domain and out-of-domain demonstrate that our method achieves performance improvements of (+30.00\%, +2.00\%) and (+5.83\%, +5.00\%) compared to the backbone model and vanilla Group Relative Preference Optimization (GRPO) model, respectively. Additionally, we compare the performance of two post-training paradigms (Supervise Fine-Tuning and RL) and systematically analyze the reasoning trajectories across the base model, vanilla GRPO model, and our proposed method.

Reality vs Counterfactual: Multi-World Contrastive Reinforcement Learning for Enhancing MLLM’s Theory of Mind in Egocentric Videos

Knockout tournaments are a widely used competition format in sports, elections, and decision-making processes. In such tournaments, players compete in successive rounds, with losers eliminated and winners advancing until a single champion remains. Given a tournament digraph $D$, which encodes the outcomes of all possible matches, and a designated player $v^* \in V(D)$, the Tournament Fixing problem (TFP) asks whether the tournament can be scheduled in a way that guarantees $v^\*$ emerges as the winner. TFP is known to be NP-hard in general (AAAI'14), but is _fixed-parameter tractable_ (FPT) when parameterized by structural measures such as the feedback arc set (fas) or feedback vertex set (fvs) number of the tournament digraph (AAAI'17; IJCAI'18; AAAI'23). In this paper, we introduce and study two new structural parameters: the number of players who can defeat $v^\*$ (i.e., the in-degree of $v^\*$, denoted $d^+$) and the number of players that $v^\*$ can defeat (i.e., the out-degree of $v^\*$, denoted $d^-$). These parameters are motivated by the observation that when either the in-degree or out-degree is zero, the problem becomes trivial. This leads to a natural question: can TFP be efficiently solved when $d^+$ or $d^-$ is small? We answer this question affirmatively by showing that TFP is FPT when parameterized by either the in-degree or out-degree of $v^*$. Our algorithm for the in-degree parameterization is particularly involved and technically intricate. Notably, the in-degree $d^+$ can remain small even when other structural parameters such as fas or fvs are large. Hence, our results offer a new perspective and significantly broaden the parameterized algorithmic understanding of the Tournament Fixing problem.

How Hard Is It to Rig a Tournament When Few Players Can Beat or Be Beaten by the Favorite?

Determining and verifying product provenance remains a critical challenge in global supply chains, particularly as geopolitical conflicts and shifting borders create new incentives for misrepresentation of commodities, such as hiding the origin of illegally harvested timber or stolen agricultural products. Stable Isotope Ratio Analysis (SIRA), combined with Gaussian process regression-based isoscapes, has emerged as a powerful tool for geographic origin verification. While these models are now actively deployed in operational settings supporting regulators, certification bodies, and companies, they remain constrained by data scarcity and suboptimal dataset selection. In this work, we introduce a novel deployed data valuation framework designed to enhance the selection and utilization of training data for machine learning models applied in SIRA. By quantifying the marginal utility of individual samples using Shapley values, our method guides strategic, cost-effective, and robust sampling campaigns within active monitoring programs. By prioritizing high-informative samples, our approach improves model robustness and predictive accuracy across diverse datasets and geographies. Our framework has been implemented and validated in a live provenance verification system currently used by enforcement agencies, demonstrating tangible, real-world impact. Through extensive experiments and deployment in a live provenance verification system, we show that this system significantly enhances provenance verification, mitigates fraudulent trade practices, and strengthens regulatory enforcement of global supply chains.

Downloads

Next from AAAI 2026

Gradient-Protected Value Decomposition for Cooperative Multi-Agent Reinforcement Learning

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES