Singapore

Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks. Agentic Reinforcement Learning (Agentic RL) optimizes such models over full tool-interaction trajectories, but two key challenges hinder effectiveness:(1) Sparse, non-instructive rewards, such as binary 0–1 verifiable signals, provide limited guidance for intermediate steps and slow convergence;(2) Gradient degradation in Group Relative Policy Optimization (GRPO), where identical rewards within a rollout group yield zero advantage, reducing sample efficiency and destabilizing training. To address these challenges, we propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO). PRS is a curriculum-inspired reward design that introduces dense, stage-wise feedback — encouraging models to first master parseable and properly formatted tool calls, then optimize for factual correctness and answer quality. We instantiate PRS for short-form QA (with a length-aware BLEU to fairly score concise answers) and long-form QA (with LLM-as-a-Judge scoring to prevent reward hacking). VSPO is an enhanced GRPO variant that replaces low-value samples with prompts selected by a task-value metric balancing difficulty and uncertainty, and applies value-smoothing clipping to stabilize gradient updates.Experiments on multiple short-form and long-form QA benchmarks show that PRS consistently outperforms traditional binary rewards, and VSPO achieves superior stability, faster convergence, and higher final performance compared to PPO, GRPO, CISPO, and SFT-only baselines. Together, PRS and VSPO yield LLM-based TIR agents that generalize better across domains.

AAAI 2026

Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

We present a training-free graph-based approach for solving interactive reasoning tasks in the ARC-AGI-3 benchmark. ARC-AGI-3 comprises game-like tasks where agents must infer task mechanics through limited interactions, and adapt to increasing complexity as levels progress. Success requires forming hypotheses, testing them, and tracking discovered mechanics. The benchmark has revealed that state-of-the-art LLMs are currently incapable of reliably solving these tasks. Our method combines vision-based frame processing with systematic state-space exploration using graph-structured representations. It segments visual frames into meaningful components, prioritizes actions based on visual salience, and maintains a directed graph of explored states and transitions. By tracking visited states and tested actions, the agent prioritizes actions that provide the shortest path to untested state-action pairs. On the ARC-AGI-3 Preview Challenge, this structured exploration strategy solves a median of 30 out of 52 levels across six games and ranks 3rd on the private leaderboard, substantially outperforming frontier LLM-based agents. Beyond its standalone performance, our approach can be viewed as a symbolic external tool: on ARC-AGI-3, it provides an explicit state graph that compensates for current LLMs failure to maintain and exploit such structure, suggesting a concrete way to augment LLM-based reasoners on interactive logical tasks. Taken together, these results indicate that explicit graph-structured exploration, even without learning, can serve as a strong baseline for interactive reasoning and underscore the importance of systematic state tracking and action prioritization in sparse-feedback environments.

Graph-Based Exploration for ARC-AGI-3 Interactive Reasoning Tasks

Alzheimer's disease intervention planning requires both predictive modeling of biomarker trajectories and counterfactual reasoning about treatment timing. We propose a neuro-symbolic architecture that integrates Fourier Neural Operators (FNOs) for physics-informed biomarker prediction with Answer Set Programming (ASP) and SMT solving for verifiable intervention planning. Our approach adapts FNO methods to learn surrogate operators for AT(N) biomarker cascade dynamics, enabling fast multi-year trajectory forecasting while preserving cascade constraints. The symbolic layer formalizes clinical knowledge using first-order logic and temporal logic rules, allowing ASP/s(CASP) to generate candidate intervention strategies that are verified against safety properties using Z3 SMT solving. This combination provides both the predictive power of deep learning and the formal guarantees of symbolic reasoning, addressing critical translational challenges in precision medicine for Alzheimer's disease.

Neuro-Symbolic AI for Alzheimer's Disease: Physics-Informed Biomarker Prediction and Verifiable Intervention Planning

Rising provider turnover results in frequently needing to rematch patients with available providers. However, the rematching process is cumbersome for both patients and health systems, resulting in labor-intensive and ad hoc reassignments. We propose a novel patient-provider matching approach to address this issue by offering patients limited provider menus. The goal is to maximize match quality across the system while preserving patient choice. We frame this as a novel variant of assortment optimization, where patient-specific provider menus are offered upfront, and patients respond in a random sequence to make their selections. This hybrid offline-online setting is understudied in previous literature and captures system dynamics across various domains. We first demonstrate that a greedy baseline policy--which offers all providers to all patients--can maximize the match rate but lead to low-quality matches. Based on this, we construct a set of policies and demonstrate that the best policy depends on problem specifics, such as a patient's willingness to match and the ratio of patients to providers. On real-world data, our proposed policy improves average match quality by 13\% over a greedy solution by tailoring assortments based on patient characteristics. Our analysis reveals a tradeoff between menu size and system-wide match quality, highlighting the value of balancing patient choice with centralized planning.

Assortment Optimization for Patient-Provider Matching

NP-Hard scheduling problems are increasingly more prevalent in our daily lives. Their inherent difficulty and larger scale constitute a real challenge to both Artificial Intelligence and Operational Research communities that continuously seek efficient approaches to solve these theoretical problems and their real-life variants. Existing approaches are either exact or heuristic. Exact algorithms, which generally tackle small-sized academic type scheduling problems, have the property of identifying an optimal solution if one exists in the state-space. Algorithms belonging to the exact class are either time-and-space limited or time-limited. Time-and-space limited algorithms, such as Branch-and-Bound algorithms, have exponential time and space complexities. That is, such algorithms can exhaust the computational memory for a given size of the input problem instance, further confirming their unsuitability for solving very large problem instances. Time-limited algorithms, such as Depth-First Search, have polynomial space complexity and an exponential time complexity. While such algorithms are unlikely to exhaust computational memory, their runtime may be very long time, thus, restricting their suitability to address large-scale problem instances too.

F heuristic search to solve scheduling with deadlines problems

Mixed Binary Quadratic Programs (MBQPs) are an important and complex set of problems in combinatorial optimization. As solving large-scale combinatorial optimization problems is challenging, primal heuristics have been developed to quickly identify high-quality solutions within a short amount of time. Recently, a growing body of research has also used machine learning to accelerate solution methods for challenging combinatorial optimization problems. Despite the increasing popularity of these ML-guided methods, a large body of work has focused on Mixed-Integer Linear Programs (MILPs). MBQPs are challenging to solve due to the combinatorial complexity coupled with nonlinearities. This work proposes ML-guided primal heuristics for Mixed Binary Quadratic Programs (MBQPs) by adapting and extending existing work on ML-guided MILP solution prediction to MBQPs. We introduce a new neural network architecture for MBQP solution prediction and a new training data collection procedure. Moreover, we extend existing loss functions in solution prediction and propose to combine contrastive weighted cross-entropy losses. We evaluate the methods on standard and real-world MBQP benchmarks and show that the developed ML-guided methods significantly outperform existing primal heuristics and state-of-the-art solvers. Furthermore, models trained with our proposed extension with combined losses outperform other ML-based methods adapted from MILPs and improve generalization in cross-regional inference on a real-world wind farm layout optimization problem.

ML-Guided Primal Heuristics for Mixed Binary Quadratic Programs

To mitigate acute wildfire ignition risks, utilities de-energize power lines in high-risk areas. The Optimal Power Shutoff (OPS) problem optimizes line energization statuses to manage wildfire ignition risks through de-energizations while reducing load shedding. OPS problems are computationally challenging Mixed-Integer Linear Programs (MILPs) that must be solved rapidly and frequently in operational settings. For a particular power system, OPS instances share a common structure with varying parameters related to wildfire risks, loads, and renewable generation. This motivates the use of Machine Learning (ML) for solving OPS problems by exploiting shared patterns across instances. In this paper, we develop an ML-guided framework that quickly produces high-quality de-energization decisions by extending existing ML-guided MILP solution methods while integrating domain knowledge on the number of energized and de-energized lines. Results on a large-scale realistic California-based synthetic test system show that the proposed ML-guided method produces high-quality solutions faster than traditional optimization methods.

Machine Learning Guided Optimal Transmission Switching to Mitigate Wildfire Ignition Risk

Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning large language models with human preferences. However, when preference data is aggregated from diverse populations, it remains unclear whether the resulting aligned models serve all demographic groups equitably, or support long-term behavior change and mental health and wellness needs in a balanced way. We investigate this question through a controlled experiment using Direct Preference Optimization (DPO), training on preferences collected from our novel synthetic dataset, the 10th Village, comprising 5,000 synthetic villagers with demographics and personality traits modeled on U.S. Census data and validated psychological instruments. Each villager provided preferences across everyday stressors in financial/employment and social/relationship domains. We introduce an alignment fairness evaluation framework that treats RLHF as a behavior-aware recommendation problem, measuring how well the aligned model matches individual villager preferences compared to the base model and analyzing disparities across demographic subgroups. Our results reveal two critical sources of inequality: First, social and relationship problems receive substantially less benefit from alignment than financial concerns (p < .001), despite already generating higher baseline dissatisfaction. Second, more educated villagers gain disproportionate benefit from alignment (p < .001), particularly for social problems, creating a compounding advantage. These findings suggest that standard RLHF practices may systematically disadvantage certain problem domains and demographic groups, highlighting the need for fairness-aware approaches to preference aggregation and model alignment. Our contributions include both the 10th Village dataset and a reusable evaluation protocol for controlled, behavior-aware fairness research, as well as empirical evidence of disparate impact in preference-based alignment to guide the design of more equitable, wellness-oriented RLHF systems.

Not All Stress Is Treated Equal: Fairness Gaps in AI Support for Everyday Problems

As AI systems increasingly mediate human interactions, most alignment frameworks erroneously assume static human preferences. We introduce Socially-Aware Continual Learning (SCL), a framework that maintains ethical alignment with dynamically evolving norms through norm embeddings and Social Elastic Weight Consolidation (SEWC)—a novel algorithm that adapts regularization strength based on measured norm drift. Extensive experiments on longitudinal datasets demonstrate SCL’s superior balance between Alignment Stability (F1 = 0.84) and Normative Plasticity (F1 = 0.87), significantly outperforming state-of-the-art baselines (p < 0.001). Our contributions include: (1) validated drift detection metrics (BDI, DCS) achieving 0.89 F1-score, (2) human evaluations showing 82% trust recovery after norm shifts, (3) formalized fairness-utility trade-offs with 4% versus 20% disparity for baselines, and (4) societal-scale simulation showing 36% polarization reduction. SCL provides both theoretical stability guarantees and practical tools for developing socially responsive AI.

Socially-Aware Continual Learning: Modeling Dynamic Alignment with Evolving Human Norms

Cognitive behavioural therapy (CBT) and exposure therapy (ET) are among the most effective interventions for anxiety and related mental health conditions, yet their traditional delivery remains limited due to issues of scalability, personalisation, evaluation accuracy and constructive user engagement. This position paper proposes a framework for affect-intelligent virtual reality (AIVR), an AI-driven and ethically grounded approach that integrates immersive technology, physiological sensing and automated therapy adaptation to address the core limitations of conventional CBT. By leveraging real-time affective data and behaviour-aware modelling, AIVR systems can dynamically tailor exposure difficulty, provide personalised feedback and support, and offer interpretable feedback to both users and clinicians. The framework outlines how adaptive AI can (1) extend therapy beyond traditionally inefficient conversation-based formats, (2) streamline individualisation through continuous learning, (3) ensure safe and transparent evaluation, and (4) promote long-term behavioural resilience. Emphasising ethical design, transparency and human oversight, AIVR reimagines AI not as a replacement for therapists, but as a collaborative partner in mental health care. This paper calls for interdisciplinary collaboration across AI, human-computer interaction, psychology and ethics to realise trustworthy, behaviour-aware VR interventions that align therapeutic innovation with human values.

Affect-Intelligent Virtual Reality (AIVR): Overcoming the Core Challenges of Cognitive Behavioural Therapy

In this tutorial, we chart a practical path from raw capability to trustworthy reasoning with foundation models. We begin by motivating why trustworthy reasoning is essential: when models bluff multiplications or invent drug interactions, their value collapses and risks increase. We adopt four pillars of trustworthiness, i.e., capability, safety, robustness, and explainability, as the organizing framework for the entire session.

In Part I, we trace the evolution from early language models to today’s foundation models that produce extended chains of thought and act in the world. Through concrete case studies, we dissect jailbreaks, hallucinations, and brittle logic, and we connect these failure modes to regulatory pressure such as the EU AI Act. The takeaway is clear: we must design for trustworthy reasoning from the outset, especially in high-stakes domains such as clinical or financial decision-making.

In Part II, we move from leaderboards to a science of measurement. We show how to build reliable, valid evaluations using psychometric tools, including item response theory, amortized evaluation, and predictability analysis. We implement three open-source pipelines hands-on: TruthfulQA for hallucination detection, HellaSwag for robustness testing, and MATH with formal-verification hooks in Lean4. Along the way, we demonstrate red-teaming stress tests and reasoning-trace metrics that surface subtle errors leaderboards miss, and we practice calibration, dataset curation, and transparent reporting for honest progress tracking.

In Part III, we deliver a compact methodology for trustworthy machine reasoning. We cover training-free prompting methods (chain-of-thought, retrieval-augmented generation, constrained decoding), post-training algorithms (supervised fine-tuning, RLHF, verifiable rewards, self-reward), and test-time techniques (self-consistency, reflection, tree search, tool-augmented verification). We introduce guardrails—safe sampling and semantic filters—that reduce risk without crippling capability. For each technique, we map effects to the four pillars, highlight trade-offs and failure signatures, and summarize when to combine methods for maximum leverage.

In Part IV, we turn to deployment. We walk through real-world agents and workflows, e.g., Lean4-based code verification assistants and bioinformatics pipelines proposing candidate compounds. We share step-by-step recipes, failure checklists, and diagnostics so participants can preserve trust while shipping. We also outline governance artifacts—risk registers, evaluation cards, and incident playbooks—that align technical practice with policy expectations.

We emphasize open, reproducible assets and decision rubrics that translate research into dependable products. Our goal is simple: help you move from compelling demos to trustworthy systems that earn and deserve user trust.

All materials will be available at: https://trustworthy-machine-reasoning.github.io/

Premium content

Next from AAAI 2026

Graph-Based Exploration for ARC-AGI-3 Interactive Reasoning Tasks

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES