Singapore

Recent advances in learnable reward shaping have shown promise in single-agent reinforcement learning by automatically discovering effective feedback signals. However, the effectiveness of decentralized learnable reward shaping in cooperative multi-agent settings remains poorly understood. We propose DMARL-RSA, a fully decentralized system where each agent learns individual reward shaping, and evaluate it on cooperative navigation tasks in the simple\_spread\_v3 environment. Despite sophisticated reward learning, DMARL-RSA achieves only $-24.20 \pm 0.09$ average reward, compared to MAPPO with centralized training at $1.92 \pm 0.87$—a 26.12-point gap. DMARL-RSA performs similarly to simple independent learning (IPPO: $-23.19 \pm 0.96$), indicating that advanced reward shaping cannot overcome fundamental decentralized coordination limitations. Interestingly, decentralized methods achieve higher landmark coverage ($0.888 \pm 0.029$ for DMARL-RSA, $0.960 \pm 0.045$ for IPPO out of 3 total) but worse overall performance than centralized MAPPO ($0.273 \pm 0.008$ landmark coverage)—revealing a coordination paradox between local optimization and global performance. Analysis identifies three critical barriers: (1) non-stationarity from concurrent policy updates, (2) exponential credit assignment complexity, and (3) misalignment between individual reward optimization and global objectives. These results establish empirical limits for decentralized reward learning and underscore the necessity of centralized coordination for effective multi-agent cooperation.

AAAI 2026

On the Fundamental Limitations of Decentralized Learnable Reward Shaping in Cooperative Multi-Agent Reinforcement Learning

workshop paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Recently, multi-robot systems have gained significant attention for their promise of scalable efficiency, reliability, and cost savings. A crucial capability is collaborative transportation, where a team of robots works together to transport a payload. However, key challenges remain, such as potential conflicts between team-level decisions and individual-level robot controls, team kinematic constraints imposed by the robot-payload coupling, and diverse obstacles encountered in 3D terrain. We present Collaborative Quadruped Transportation with Constrained Diffusion (CQTD), enabling a team of closely coupled quadruped robots to collaboratively transport a payload across 3D terrain. A diffusion-based upper level learns terrain-aware team-level trajectories satisfying team kinematic constraints due to the payload coupling, while a lower level optimizes velocity controls of individual robots satisfying collision and anisotropic velocity constraints. Experiments in high-fidelity simulations and on real-world quadruped robot teams demonstrate that CQTD outperforms baseline methods in challenging 3D terrain scenarios requiring closely-coupled collaboration between the quadruped robots.

Collaborative Quadruped Transportation in 3D Terrain with Constrained Diffusion

Recent advances in multi-agent reinforcement learning (MARL) have demonstrated success in numerous challenging domains and environments, but typically require specialized models for each task. In this work, we propose a coherent methodology that makes it possible for a single GPT-based model to learn and perform well across diverse MARL environments and tasks, including collision-avoidance and coordination problems (such as multi-agent path finding scenarios demonstrated in POGEMA), alongside established benchmarks like StarCraft Multi-Agent Challenge and Google Research Football. Our method, MARL-GPT, applies offline reinforcement learning to train at scale on expert trajectories (400M for SMACv2, 100M for GRF, and 1B for POGEMA) combined with a single transformer-based observation encoder that requires no task-specific tuning. By leveraging offline RL, we address the long-horizon planning and coordination challenges inherent in MAPF-like problems, enabling efficient learning without costly online environment interaction. Experiments show that MARL-GPT achieves competitive performance compared to specialized baselines in all tested environments. Thus, our findings suggest that it is, indeed, possible to build a multi-task transformer-based model for a wide variety of (significantly different) multi-agent problems paving the way to the fundamental MARL model (akin to ChatGPT, Llama, Mistral etc. in natural language modeling).

MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

Large language model (LLM) agents have shown increasing promise for collaborative task completion. However, existing multi-agent frameworks often rely on static workflows, fixed roles, and limited inter-agent communication, reducing their effectiveness in open-ended, high-complexity domains. This paper proposes a coordination framework that enables adaptiveness through three core mechanisms: dynamic task routing, bidirectional feedback, and parallel agent evaluation. The framework allows agents to reallocate tasks based on confidence and workload, exchange structured critiques to iteratively improve outputs, and—crucially—compete on high-ambiguity subtasks with evaluator-driven selection of the most suitable result. We instantiate these principles in a modular architecture and demonstrate substantial improvements in factual coverage, coherence, and efficiency over static and partially adaptive baselines. Our findings highlight the benefits of incorporating both adaptiveness and structured competition in multi-agent LLM systems.

Parallelism Meets Adaptiveness: Scalable Documents Understanding in Multi-Agent LLM Systems

Priority Inheritance and Backtracking (PIBT) has demonstrated significant success as a heuristic for large-scale Multi-Agent Path Finding (MAPF), but its application has been limited to scenarios where all robots move at uniform speeds. This work addresses the critical question of extending PIBT's capabilities to heterogeneous fleets, where robots may vary in size and speed. We introduce a novel collision-checking-based reservation system integrated within the WinPIBT framework and a new backtracking mechanism. Heterogeneous PIBT (HetPIBT) effectively scales to hundreds of robots of various sizes. Our findings confirm that this scheme enables the application of heterogeneous MAPF to large robot populations. To facilitate further research, we also provide a new set of heterogeneous MAPF benchmark scenarios along with a Python package for the generation of such problems and visualization tools. However, we observe that HetPIBT, similar to its homogeneous counterpart, can yield suboptimal solutions. We provide an open source Rust package which users can build upon.

Extending PiBT to support Heterogenous Robot Fleets

Multi-Agent Path Finding (MAPF) is a one-shot problem of finding collision-free paths in a shared environment while minimizing the sum of the agents' travel times.
Since solving MAPF optimally is NP-hard, $w$-optimal algorithms such as Explicit Estimation Conflict-Based Search (EECBS) have been used to speed up the search while providing a guarantee on the solution quality.
However, the scalability of EECBS is limited in large-scale MAPF instances.
While EECBS can be accelerated for regularly structured environments, such as Kiva warehouses, by utilizing specialized guidance heuristics, these heuristics are ineffective in more general and large-scale environments.
To fill this gap, we propose the \textit{Flow-Based Guidance Framework}, a general two-phase process that simulates a list of paths and then generates the \textit{Flow-Based Guidance Heuristic} (FH) without making prior assumptions about the environment's structure.
We identify features that distinguish $w$-optimal MAPF from other MAPF variants and propose strategies to enhance its effectiveness for guidance, complemented by the flex distribution technique from EECBS.
The empirical evaluation demonstrates that our FH significantly reduces collisions, thereby achieving higher success rates than the state-of-the-art within 60 seconds.

Flow-Based Guidance Framework with Flex Distribution for $w$-Optimal Multi-Agent Path Finding

State-of-the-art multi-robot kinodynamic motion planners struggle to handle more than a few robots due to high computational burden, which limits their scalability and results in slow planning time.
In this work, we combine the scalability and speed of modern multi-agent path finding (MAPF) algorithms with the dynamic-awareness of kinodynamic planners to address these limitations.
To this end, we propose discontinuity-Bounded LaCAM (db-LaCAM), a planner that utilizes a precomputed set of motion primitives that respect robot dynamics to generate horizon-length motion sequences, while allowing a user-defined discontinuity between successive motions.
The planner db-LaCAM is resolution-complete with respect to motion primitives and supports arbitrary robot dynamics.
Extensive experiments demonstrate that db-LaCAM scales efficiently to scenarios with up to $50$ robots, achieving up to ten times faster runtime compared to state-of-the-art planners, while maintaining comparable solution quality.
The approach is validated in both 2D and 3D environments with dynamics such as the unicycle and 3D double integrator.
We demonstrate the safe execution of trajectories planned with db-LaCAM in two distinct physical experiments involving teams of flying robots and car-with-trailer robots.

db-LaCAM: Fast and Scalable Multi-Robot Kinodynamic Motion Planning with Discontinuity-Bounded Search and Lightweight MAPF

Multi-Agent Path Finding (MAPF) is a fundamental problem in robotics and logistics, where multiple agents must reach their goals without collisions. While deep Multi-Agent Reinforcement Learning methods have recently shown impressive scalability and adaptability, their black-box nature hinders interpretability and trust—crucial aspects for deployment in real-world systems. In this work, we propose an interpretable policy distillation framework for MAPF. We first formulate MAPF as a stochastic game and execute a trained neural policy across diverse environments to build a large dataset of state–action pairs. We then distill this neural policy into a decision tree model that captures its underlying decision rules while maintaining strong performance. Through extensive evaluation, we analyze the trade-off between interpretability and performance, demonstrating that our distilled models achieve high fidelity to the original policy while providing transparent, human-understandable reasoning about agent behavior.

Interpretable Multi-Agent Path Finding via Decision Tree Extraction from Neural Policies

Multi-Agent Path Finding (MAPF) is a representative multi-agent coordination problem, where multiple agents are required to navigate to their respective goals without collisions. Solving MAPF optimally is known to be NP-hard, leading to the adoption of learning-based approaches to alleviate the online computational burden. Prevailing approaches, such as Graph Neural Networks (GNNs), are typically constrained to *pairwise* message passing between agents. However, this limitation leads to suboptimal behaviours and critical issues, such as attention dilution, particularly in dense environments where group (i.e. beyond just two agents) coordination is most critical. Despite the importance of such higher-order interactions, existing approaches have not been able to fully explore them. To address this representational bottleneck, we introduce HMAGAT (Hypergraph Multi-Agent Attention Network), a novel architecture that leverages attentional mechanisms over directed hypergraphs to explicitly capture group dynamics. Empirically, HMAGAT establishes a new state-of-the-art among learning-based MAPF solvers: e.g., despite having just 1M parameters and being trained on 100$\times$ less data, it outperforms the current SoTA 85M parameter model. Through detailed analysis of HMAGAT's attention values, we demonstrate how hypergraph representations mitigate the attention dilution inherent in GNNs and capture complex interactions where pairwise methods fail. Our results illustrate that appropriate inductive biases are often more critical than the training data size or sheer parameter count for multi-agent problems.

Pairwise is Not Enough: Hypergraph Neural Networks for Multi-Agent Pathfinding

Federated learning (FL) enables clients to collaboratively train a shared model in a distributed manner, setting it apart from traditional deep learning paradigms. However, most existing FL research assumes consistent client participation, overlooking the practical scenario of dynamic participation (DPFL), where clients may intermittently join or leave during training. Moreover, no existing benchmarking framework systematically supports the study of DPFL-specific challenges. In this work, we present the first open-source framework explicitly designed for benchmarking FL models under dynamic client participation. Our framework provides configurable data distributions, participation patterns, and evaluation metrics tailored to DPFL scenarios. Using this platform, we benchmark four major categories of widely adopted FL models and uncover substantial performance degradation under dynamic participation. To address these challenges, we further propose Knowledge-Pool Federated Learning (KPFL), a generic plugin that maintains a shared knowledge pool across both active and idle clients. KPFL leverages dual-age and data-bias weighting, combined with generative knowledge distillation, to mitigate instability and prevent knowledge loss. Extensive experiments demonstrate the significant impact of dynamic participation on FL performance and the effectiveness of KPFL in improving model robustness and generalization.

Dynamic Participation in Federated Learning: Benchmarks and a Knowledge Pool Plugin

The orbital environment is increasingly congested, heightening collision risk and demanding robust Space Situational Awareness (SSA). Ground-based tracking and centralized learning face latency, fragmented datasets, and strict privacy limits on telemetry sharing. While ML aids orbital prediction, purely data-driven models fail under sparse or irregular observations; Physics-Informed Neural Networks (PINNs) embed dynamics for physical consistency. However, locally trained PINNs lack shared context, fragmenting awareness. Collaboration is often constrained by privacy policy, export controls, or mission secrecy—sometimes forcing purely local learning and leaving blind spots. This motivates Federated Learning (FL), where satellites share model updates (not raw data) to refine physics-consistent predictors while preserving data autonomy. However, single-server FL is ill-suited for orbital networks, as it creates a single point of failure, over-smooths data, and exposes vulnerability to link outages. We therefore propose Graph-Decentralized Federated Learning (Graph-DFL) for multi-satellite SSA: a serverless framework where satellites exchange quantized incremental updates only with neighbors and reach consensus via topology-aware diffusion. Each Low Earth Orbit (LEO) client trains a GRU-based deinterleaver and a local PINN, while Medium Earth Orbit (MEO) relays apply confidence-weighted Cayley–Menger × Light-Cone (CM × LC) fusion. Experiments on real Two-Line Element (TLE)–derived SGP4 trajectories show that Graph-DFL achieves high deinterleaver accuracy and low trajectory RMSE, indicating a resilient, physics-consistent, privacy-preserving solution for SSA without centralization.

Premium content

Next from AAAI 2026

Collaborative Quadruped Transportation in 3D Terrain with Constrained Diffusion

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES