Singapore

In reinforcement learning (RL), it is often advantageous to consider additional constraints on the action space to ensure safety or action relevance. Existing work on such action-constrained \ac{RL} faces challenges regarding expressive policy updates, computational efficiency, and predictable runtime.
Recent work proposes to use truncated normal distributions for stochastic policy gradient methods. However, the computation of key characteristics, such as the entropy, log-probability, and their gradients, becomes intractable under complex constraints. Hence, prior work approximates these using the non-truncated distributions, which severely degrades performance.
We argue that accurate estimation of these characteristics is crucial in the action-constrained \ac{RL} setting, and propose efficient numerical approximations for them.
We also provide an efficient sampling strategy for truncated policy distributions and validate our approach on three benchmark environments, which demonstrate significant performance improvements when using accurate estimations.

AAAI 2026

Improving Stochastic Action-Constrained Reinforcement Learning via Truncated Distributions

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Model-free deep reinforcement learning (RL) algorithms have achieved tremendous success on a range of challenging tasks. However, safety concerns remain when these methods are deployed on real-world applications, necessitating risk-aware agents. A common utility for learning such risk-aware agents is the entropic risk measure, but current policy gradient methods optimizing this measure must perform high-variance and numerically unstable updates. As a result, existing risk-sensitive model-free approaches are limited to simple tasks and tabular settings. In this paper, we provide a comprehensive theoretical justification for policy gradient methods on the entropic risk measure, including on- and off-policy gradient theorems for the stochastic and deterministic policy settings. Motivated by theory, we propose risk-sensitive exponential actor-critic (rsEAC), an off-policy model-free approach that incorporates novel procedures to avoid the explicit representation of exponential value functions and their gradients, and optimizes its policy w.r.t. the entropic risk measure. In this way, we show that rsEAC produces more numerically stable updates compared to existing approaches and reliably learns risk-sensitive policies in challenging risky variants of continuous tasks in MuJoCo.

Risk-Sensitive Exponential Actor Critic

Designing mechanical mechanisms to trace specific paths is a classic yet notoriously difficult engineering problem, characterized by a vast and complex search space of discrete topologies and continuous parameters. We introduce MechaFormer, a Transformer-based model that tackles this challenge by treating mechanism design as a conditional sequence generation task. Our model learns to translate a target curve into a domain-specific language (DSL) string, simultaneously determining the mechanism's topology and geometric parameters in a single, unified process. MechaFormer significantly outperforms existing baselines, achieving state-of-the-art path-matching accuracy and generating a wide diversity of novel and valid designs. We demonstrate a suite of sampling strategies that can dramatically improve solution quality and offer designers valuable flexibility. Furthermore, we show that the high-quality outputs from MechaFormer serve as excellent starting points for traditional optimizers, creating a hybrid approach that finds superior solutions with remarkable efficiency.

MechaFormer: Sequence Learning for Kinematic Mechanism Design Automation

AI assistants that assume users to be fully rational generally fail to understand as well as predict users' sub-optimal behaviors, making it difficult to provide adaptive assistance.
In many cases, such behaviors are not a result of irrationality, but rather a rational decision made given inherent cognitive bounds and biased beliefs about the world.
In this paper, we formally introduce a class of computational-rational (CR) user models for cognitively-bounded agents acting optimally under biased beliefs. 
The key novelty lies in explicitly modeling how a bounded cognitive process, such as imperfect memory, leads to a dynamically inconsistent and biased belief state, and consequently, sub-optimal sequential decision-making.
We address the challenge of identifying the latent user-specific bound and inferring biased belief states from passive observations on the fly.
We argue that for our formalized CR model family with an explicit and parameterized cognitive process, this challenge is tractable.
To support our claim, we propose an efficient online inference method based on nested particle filtering that simultaneously tracks the user's latent belief state and estimates the unknown cognitive bound from a stream of observed actions.
We validate our approach in a representative navigation task using memory capacity as an instance of cognitive bound. 
With simulations, we show that (1) our CR model generates intuitively plausible behaviours corresponding to different levels of memory capacity, and (2) our inference method accurately and efficiently recovers the ground-truth cognitive bounds from limited observations ($\le 100$ steps). 
We further demonstrate how this modeling approach provides a principled foundation for developing adaptive AI assistants within an assistive-POMDP, enabling adaptive assistance that takes the user's cognitive bounds into account.

More than Irrational: Modeling Belief-Biased Agents

State-of-the-art methods for Human-AI Teaming and Zero-shot Cooperation focus on task completion i.e. task rewards, as the sole evaluation metric while being agnostic to `how' the two agents work with each other. Furthermore, subjective user studies only offer limited insight into the quality of cooperation existing within the team. Specifically, we are interested in understanding the cooperative behaviors arising within the team when trained agents are paired with humans - a problem that has been overlooked by the existing literature. To formally address this problem, we propose the concept of constructive interdependence - measuring how much agents rely on each other’s actions to achieve the shared goal - as a key metric for evaluating cooperation in human-agent teams. We measure interdependence in terms of action interactions in a STRIPS formalism, and define metrics that allow us to assess the degree of reliance between the agents' actions. We pair state-of-the-art agents with learned human models as well as human participants in a user study for the popular Overcooked domain, and evaluate the task reward and teaming performance for these human-agent teams. While prior work has claimed that state-of-the-art agents exhibit cooperative behavior based on their high task rewards, our results reveal that these agents often fail to induce cooperation, as evidenced by consistently low interdependence across teams. Furthermore, our analysis reveals that teaming performance is not necessarily correlated with task reward, highlighting that task reward alone cannot reliably measure cooperation arising in a human-agent team.

Who Is Helping Whom? Analyzing Inter-Dependencies to Evaluate Cooperation in Human-AI Teaming

As machine learning (ML) applications grow increasingly complex in recent years, modern ML frameworks often need to address multiple potentially conflicting objectives with coupled decision variables across different layers. This creates a compelling need for multi-objective bilevel learning (MOBL). So far, however, the field of MOBL remains in its infancy and many important problems remain under-explored. This motivates us to fill this gap and systematically investigate the theoretical and algorithmic foundation of MOBL. Specifically, we consider MOBL problems with multiple conflicting objectives guided by preferences at the upper-level subproblem, where part of the inputs depend on the optimal solution of the lower-level subproblem. Our goal is to develop efficient MOBL optimization algorithms to (1) identify a preference-guided Pareto-stationary solution with low oracle complexity; and (2) enable systematic Pareto front exploration. To this end, we propose a unifying algorithmic framework called weighted-Chebyshev multi-hyper-gradient-descent (WC-MHGD) for both deterministic and stochastic settings with finite-time Pareto-stationarity convergence rate guarantees, which not only implies low oracle complexity but also induces systematic Pareto front exploration. We further conduct extensive experiments to confirm our theoretical results.

Multi-Objective Bilevel Learning

Training large neural network models requires extensive computational resources, often distributed across several nodes and accelerators. Recent findings suggest that it may be sufficient to only exchange the fast moving components of the gradients, while accumulating momentum locally (Decoupled Momentum, or DeMo). However, DeMo assumes that models fit on a single accelerator. We relax this assumption and introduce FlexDeMo, whereby nodes fully shard model parameters locally between different accelerators, while inter-node communication is reduced by synchronizing only fast-moving components instead of the full gradients -- resulting in a hybrid sharded data parallel training strategy. We further introduce a framework, denoted as DeToNATION, that generalizes DeMo, FlexDeMo, and other popular distributed training schemes such as DiLoCo -- introducing new variations of replication schemes and challenging choices made in DeMo. Our results across language and vision domains show that FlexDeMo attains similar validation loss as hybrid sharded data parallel training employing AdamW and full gradient synchronization, while being substantially faster. FlexDeMo is thus a promising distributed training scheme for the largest machine learning models.

DeToNATION: Decoupled Torch Network-Aware Training on Interlinked Online Nodes

Conformal prediction (CP) is a general framework to quantify the predictive uncertainty of machine learning models that uses a set prediction to include the true label with a valid probability.
To align the uncertainty measured by CP, conformal training methods minimize the size of the prediction sets.
A typical way is to use a surrogate indicator function, usually Sigmoid or Gaussian error function.
However, these surrogate functions do not have a uniform error bound to the indicator function, leading to uncontrollable learning bounds.
In this paper, we propose a simple cost-sensitive conformal training algorithm that does not rely on the indicator approximation mechanism.
Specifically, we theoretically show that minimizing the expected size of prediction sets is upper bounded by the expected rank of true labels.
To this end, we develop an importance weighting strategy that assigns the weight using the rank of true label on each data.
Our analysis provably demonstrates the tightness between the proposed weighted objective and the expected size of conformal prediction sets.
Extensive experiments verify the validity of our theoretical insights, 
and superior empirical performance over other conformal training in terms of predictive efficiency with $21.38\\%$ reduction for average prediction set size.

Cost-Sensitive Conformal Training with Provably Controllable Learning Bounds

Prompt tuning enables Vision-Language Models (VLMs) to efficiently adapt to new tasks through learnable prompt vectors. This naturally raises a question: do these prompts leak private information about their training data? While Membership Inference Attacks (MIAs) can quantify this risk, current methods rely on access to model outputs or internal gradients. This limitation prevents a clear assessment of a prompt’s standalone privacy leakage, particularly in deployment scenarios where such information is inaccessible. 
In this paper, we propose Prompt Intrinsic Privacy Risk Analyzer (PIPRA) to address this gap. As the first output-free MIA, PIPRA leverages open-source pre-trained VLMs to extract features from both prompts and samples within a shared cross-modal semantic space. By employing a contrastive learning-based feature projector to enhance these representations, PIPRA enables a subsequent discriminator to effectively perform membership inference.
Extensive experiments across nine benchmark datasets and multiple VLMs show PIPRA achieves an average AUC of 87.58\%, significantly outperforming traditional output-dependent methods (77.05\%). These findings reveal that prompts pose a substantially greater privacy risk than previously recognized, highlighting the urgent need for prompt-level privacy protection.

Your Prompts Are Not Safe: Output-Free Membership Inference via Prompt Vectors in Vision-Language Tuning

Federated learning (FL) enables collaborative model training
without centralizing data. In multi-domain scenarios with
non-identically and independently distributed (non-IID) data,
prediction performance is often hindered by catastrophic forgetting
of specialized local knowledge and negative transfer
from conflicting client updates. To address these challenges,
we propose a personalized FL framework with dual-branch
(pFedDB) structure and a two-phase training protocol. The
dual-branch architecture separates the model into a shared
branch for cross-client aggregation and a private branch that
remains on each local client. The private branch is never overwritten
by server updates, which prevents the catastrophic
forgetting of domain-specific knowledge. This structure also
significantly reduces communication overhead per round as
only the shared branch is transmitted. To mitigate negative
transfer, our two-phase protocol first establishes a personalized
knowledge anchor by training a single-branch expert
model on each client’s local data. In the second phase, the locally
trained model is cloned to initialize private and shared
branches. Only the shared branch is aggregated in federated
training. This process enables the shared branch to learn a
general representation that complements the established local
expertise. This design consistently improves the performance
of every client over its single-domain baseline, overcoming
the challenge of negative transfer among clients. Experiments
on our new Chest-X-Ray-4 suite and three public benchmarks
show that the proposed pFedDB method obtains 30% saving
in communication overhead per round and competitive or better
accuracy performance than recent FL methods.

Decoupling Shared and Personalized Knowledge: A Dual-Branch Federated Learning Framework for Multi-Domain with Non-IID Data

The escalating global demand for mental health services highlights the potential of Large Language Models (LLMs) in psychological counseling. However, current LLM-based approaches, particularly fine-tuned models, are constrained by data distribution biases, leading to limited therapeutic diversity and personalization. Crucially, they often lack anticipatory empathetic reasoning, struggling to foresee patient emotional responses beyond immediate dialogue history, and incur substantial computational costs.
To address these limitations, we propose PsyPARSE, a novel training-free framework for psychological counseling that emulates the deliberate and empathetic reasoning of human counselors. PsyPARSE integrates Multi-Therapy Retrieval-Augmented Generation (RAG) to overcome data biases and provide highly personalized therapeutic approaches tailored to individual patient attributes. Pioneering the first multi-stage slow-thinking engine in mental health LLMs, PsyPARSE employs Multi-Turn Rollouts to identify optimal therapeutic paths and through anticipating patient reactions, optimizes empathetic responses, thereby ensuring genuinely empathetic and impactful responses in complex, long-dialogue interactions.
Operating as a plug-and-play solution, PsyPARSE avoids the computational burden of fine-tuning. We establish a comprehensive LLM-based patient-therapist agent simulation framework for evaluation. Extensive experiments demonstrate that PsyPARSE significantly enhances the capabilities of various LLM baselines, achieving superior personalization and deeper empathy compared to both fine-tuned and other training-free methods. This work offers an efficient, adaptable, and scalable solution for advancing mental health support.

Downloads

Next from AAAI 2026

Risk-Sensitive Exponential Actor Critic

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES