Singapore

In cooperative Multi-Agent Reinforcement Learning (MARL), the subgroup-wise learning is employed to assign sub-tasks to agents towards the enhancement of team collaboration. However, the present work is dependent on manually defined allocation criteria, which hinders its capacity to adapt to environmental changes promptly, and also relaxes communication restrictions, thereby constraining the application of algorithms in a range of fields. In order to address these issues, the Autonomous Partner Selection (APS) framework is proposed, which offers an implicit grouping mechanism in an autonomous way. Each agent is capable of autonomously selecting cooperative partners and integrating their own observations with those of partners to harmonise the cooperative behaviour during the training stage. With a view to strictly restricting communication, the intention encoder is trained through information distillation, which enables agents to selectively take more cooperative actions based solely on local observations. Meanwhile, in order to circumvent potential conflicts engendered by homogenization behaviour, we employ a contrastive learning strategy to the cooperative intention generated by agents, thereby ensuring that the behavioural tendencies exhibited by different individuals remain as diverse as possible. Finally, extensive comparative experiments on the StarCraft Multi-Agent Challenge and Google Research Football are conducted. The results demonstrate that APS exhibits superior performance in comparison to the state-of-the-art algorithms across a range of tasks, and agents can adapt their grouping strategies in accordance with the environment to facilitate enhanced cooperation.

AAAI 2026

Autonomous Partner Selection for Cooperative Multi-Agent Reinforcement Learning

information distillation

implicit subgroup

multi-agent system

reinforcement learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Combining Mixture of Experts (MoE) with Low-Rank Adaptation (LoRA) has shown promising efficiency in multi-task instruction tuning for Large Language Models (LLMs). While existing routing schemes for such MoE systems employ auxiliary functions to ensure both expert selection certainty and workload balance among experts, they are hindered by two critical challenges: (1) Existing methods overlook the evolving cross-expert relationships across layers, leading to inefficient expert utilization. (2) The auxiliary functions fail to incorporate cross-task semantic characteristics during expert assignment, leading to suboptimal task adaptation. To address these challenges, we propose $\textbf{H}$ybrid r$\textbf{o}$u$\textbf{t}$ing for a $\textbf{M}$ixture $\textbf{o}$f LoRA $\textbf{E}$xperts ($\textbf{HotMoE}$), a novel multi-task instruction tuning framework that adapts hierarchical routing to the distinct characteristics of different LLM layers. First, we design a $\textit{hybrid routing module}$. In lower layers, expert-expert attention facilitates cross-task collaboration and generalization.
In higher layers, token-expert attention enables precise alignment between task semantics and specialized experts. 
Second, we introduce a $\textit{similarity-guided auxiliary loss module}$ to regularize routing decisions by exploiting hidden state similarities. This loss synergistically reinforces expert specialization without sacrificing certainty of expert selection by promoting cohesive activation patterns among semantically related tasks while sharpening distinctions between conflicting ones. Experiments across two multi-task instruction tuning scenarios covering seven NLP benchmarks demonstrate that HotMoE consistently outperforms all baselines, improving Mean Relative Difference by up to 1.68\% with only 3.1\% of trainable parameters.

Hybrid Routing for a Mixture of LoRA Experts

In modern Computer-Aided Design (CAD), parametric sketches play a crucial role by capturing both the geometric structure and design intent through constraints. However, existing deep learning–based sketch methods remain restricted to simple geometric primitives and limited constraint types, hindering their application to complex real-world engineering tasks. To address this gap, we introduce the UniSketch dataset, comprising 3,836,290 sketches. It offers a comprehensive and diverse collection of 7 types of geometric primitives and 23 types of 2D constraints, all represented as unified vector sequences suitable for deep learning applications. Leveraging the UniSketch dataset, we propose a unified multi-task Transformer framework as a true foundation model for parametric sketch modeling, supporting diverse core tasks like image-to-sketch generation, constraint prediction, and unconditional sketch synthesis. Furthermore, the generated sketches can be efficiently converted to CAD-compatible formats, enabling seamless integration with industrial CAD system for re-editing and reusing. The experimental results show that UniSketch outperforms existing methods in multiple tasks, demonstrating its versatility and practical value in industrial CAD applications.

UniSketch: A Unified Framework for Parametric Sketch Generation and Constraint Prediction

Prevalent pre-training strategies for Brain-Computer Interfaces (BCIs) are often constrained by spatio-temporal entanglement. This critical issue arises from processing multi-channel Electroencephalography (EEG) signals as monolithic sequences, which intertwines the signal's temporal dynamics with its spatial topography and hinders the learning of robust and generalizable representations. To address this, we introduce BraSTORM, a framework that explicitly disentangles EEG data into separate temporal and spatial streams at the input level. Two streams are processed by parallel encoders trained with a composite dual-objective: a masked signal reconstruction loss captures fine-grained, intra-modal details, while a cross-modal contrastive loss enforces high-level semantic alignment. Extensive fine-tuning experiments on six benchmarks covering three major BCI downstream tasks—Emotion Recognition, Sleep Staging, and Motor Imagery—demonstrate that BraSTORM achieves state-of-the-art performance. Our findings validate that resolving spatio-temporal entanglement at the input level can be a competitive pre-training framework for the BCI field.

BraSTORM: A Dual-Branch Self-Supervised Framework for EEG Representation Learning via Input-Level Spatio-Temporal Decomposition

Reinforcement Learning from Human Feedback (RLHF) is a methodology that aligns agent behavior with human preferences by utilizing signals such as scalar rewards, comparative preferences, or physical demonstrations from human users. We introduce NEURO-LOOP, a framework for Reinforcement Learning from Neural Feedback that leverages passive Brain-Computer Interfaces (BCI) to infer user assessments directly from brain activity. We present and release a novel dataset of functional near-infrared spectroscopy (fNIRS) recordings collected from 25 human participants observing agent behavior in three different domains: a Pick-and-Place Robot task, Lunar Lander, and Flappy Bird. We train classifiers to predict levels of agent performance (optimal, sub-optimal, or poor) from windows of preprocessed fNIRS feature vectors, achieving an average F1 score of 67% for binary classification and 46% for multi-class models across conditions. We also train regressors to predict the degree of deviation between the agent's chosen action and a set of near-optimal actions, providing a continuous measure of performance. To evaluate cross-subject generalization, we use a leave-one-subject-out approach to demonstrate that fine-tuning a pre-trained model with a small sample of witheld, subject-specific data increases average F1 scores by 17% for binary classification and 41% for multi-class models. Our work demonstrates that mapping implicit neural feedback to agent performance is feasible, laying the foundation for integrating brain data into future RLHF systems.

Towards Reinforcement Learning from Neural Feedback: Mapping fNIRS Signals to Agent Performance

We introduce FIXME, the first end-to-end, open-source, and large-scale benchmark for evaluating Large Language Models (LLMs) in hardware design functional verification (FV). Comprising 747 tasks derived from real-world hardware designs, FIXME spans five core FV sub-sets: specification comprehension, reference model generation, testbench generation, assertion design, and RTL debugging. To ensure high data quality, we developed an AI-human collaborative framework for agile data curation and annotation. This process resulted in 25,000 lines of verified RTL, 35,000 lines of enhanced testbenches, and over 1,200 SystemVerilog Assertions. Furthermore, through expert-guided optimization within the multi-agent aided flow, we achieved a remarkable 45.57% improvement in average functional coverage, underscoring the benchmark's robustness. Through evaluation of state-of-the-art LLMs like GPT-4.1, FIXME identifies key limitations and provides actionable insights, advancing the potential of LLM-driven automation in hardware design functional verification.

FIXME: Towards End-to-End Benchmarking of LLM-Aided Design Verification

Retrieval-Augmented Generation (RAG) has been demonstrated to effectively mitigate the knowledge recency issue in Large Language Models (LLMs) while significantly reducing hallucinations. However, existing RAG methods exhibit insufficient capability in modeling reasoning paths for complex multi-hop reasoning tasks. While Reinforcement Learning (RL) has demonstrated success in enhancing model reasoning ability , token-level RL frameworks exhibit inherent limitations in maintaining coherent reasoning trajectories. This approach remains susceptible to the compounding accumulation of contextual errors during the retrieval process, ultimately resulting in erroneous output generation. To address this challenge, we propose Chain Progressive Search (CP-Search), a novel two-stage training framework designed to enhance the model's retrieval capability in complex scenarios. This framework models the entire retrieval process as a Retrieval-level Markov Decision Process , systematically optimizing the model's retrieval behavior at each step of the chained retrieval. Specifically, CP-Search first constructs a retrieval-cognitive behavioral dataset and employs Supervised Fine-Tuning (SFT) to endow the model with cognitive behaviors for searching. More importantly, by introducing a dense progressive process reward in reinforcement learning training, CP-Search significantly improves the model's reasoning consistency and feedback correction ability in chained retrieval. Experiments conducted on multiple multi-hop datasets demonstrate that CP-Search significantly outperforms existing RAG methods in complex multi-hop reasoning tasks.

CP-Search: A Chain Progressive Search Training Framework Incentivizing the Cognitive Behaviors for Searching in LLMs

Recent advances in large language models (LLMs) have enabled realistic user simulators for developing and evaluating recommender systems (RSs). However, existing LLM-based simulators for RSs face two major limitations: (1) static and single-step prompt-based inference that leads to inaccurate and incomplete user profile construction; (2) unrealistic and single-round recommendation-feedback interaction pattern that fails to capture real-world scenarios. To address these limitations, we propose DGDPO (**D**iagnostic-**G**uided **D**ynamic **P**rofile **O**ptimization), a novel framework that constructs user profile through a dynamic and iterative optimization process to enhance the simulation fidelity. Specifically, DGDPO incorporates two core modules within each optimization loop: firstly, a specialized LLM-based diagnostic module, calibrated through our novel training strategy, accurately identifies specific defects in the user profile. Subsequently, a generalized LLM-based treatment module analyzes the diagnosed defect and generates targeted suggestions to refine the profile. Furthermore, unlike existing LLM-based user simulators that are limited to single-round interactions, we are the first to integrate DGDPO with sequential recommenders, enabling a bidirectional evolution where user profiles and recommendation strategies adapt to each other over multi-round interactions. Extensive experiments conducted on three real-world datasets demonstrate the effectiveness of our proposed framework.

Diagnostic-Guided Dynamic Profile Optimization for LLM-based User Simulators in Sequential Recommendation

Accurate jailbreak evaluation is critical for LLM red team testing and jailbreak research. Mainstream methods rely on binary classification (string matching, toxic text classifiers, and LLM-based methods), outputting only "yes/no" labels without quantifying harm severity. Emerged multi-dimensional frameworks (e.g., Security Violation, Relative Truthfulness and Informativeness) use unified evaluation standards across scenarios, leading to scenario-specific mismatches (e.g., "Relative Truthfulness" is irrelevant to "hate speech"), undermining evaluation accuracy. To address these, we propose SceneJailEval, with key contributions:
(1) A pioneering scenario-adaptive multi-dimensional framework for jailbreak evaluation, overcoming the critical "one-size-fits-all" limitation of existing multi-dimensional methods, and boasting robust extensibility to seamlessly adapt to customized or emerging scenarios. 
(2) A novel 14-scenario dataset featuring rich jailbreak variants and regional cases, addressing the long-standing gap in high-quality, comprehensive benchmarks for scenario-adaptive evaluation. 
(3) SceneJailEval delivers state-of-the-art performance with an F1 score of 0.917 on our full-scenario dataset (+6\% over SOTA) and 0.995 on JBB (+3\% over SOTA), breaking through the accuracy bottleneck of existing evaluation methods in heterogeneous scenarios and solidifying its superiority.

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

In multi-hop reasoning, multi-round retrieval-augmented generation (RAG) methods typically rely on LLM-generated content as the retrieval query. However, these approaches are inherently vulnerable to knowledge overshadowing—a phenomenon where critical information is overshadowed during generation. As a result, the LLM-generated content may be incomplete or inaccurate, leading to irrelevant retrieval and causing error accumulation during the iteration process. To address this challenge, we propose ActiShade, which detects and activates overshadowed knowledge to guide large language models(LLMs) in multi-hop reasoning. Specifically, ActiShade iteratively detects the overshadowed keyphrase in the given query, retrieves documents relevant to both the query and the overshadowed keyphrase, and generates a new query based on the retrieved documents to guide the next-round iteration. By supplementing the overshadowed knowledge during the formulation of next-round queries while minimizing the introduction of irrelevant noise, ActiShade reduces the error accumulation caused by knowledge overshadowing. Extensive experiments show that ActiShade outperforms existing methods across multiple datasets and LLMs.

ActiShade: Activating Overshadowed Knowledge to Guide Multi-Hop Reasoning in Large Language Models

Text representation plays a critical role in tasks like clustering, retrieval, and other downstream applications. With the emergence of large language models (LLMs), there is increasing interest in harnessing their capabilities for this purpose. However, most of the LLMs are inherently causal and optimized for next-token prediction, making them suboptimal for producing holistic representations. To address this, recent studies introduced pretext tasks to adapt LLMs for text representation. Most of these tasks, however, rely on token-level prediction objectives, such as the masked next-token prediction (MNTP) used in LLM2Vec. In this work, we explore the untapped potential of context compression as a pretext task for unsupervised adaptation of LLMs. During compression pre-training, the model learns to generate compact memory tokens, which substitute the whole context for downstream $\textit{sequence prediction}$. Experiments demonstrate that a well-designed compression objective can significantly enhance LLM-based text representations, outperforming models trained with $\textit{token-level pretext tasks}$. Further improvements through contrastive learning produce a strong representation model (LLM2Comp) that outperforms contemporary LLM-based text encoders on a wide range of tasks while being more sample‑efficient, requiring significantly less training data.

Downloads

Next from AAAI 2026

Hybrid Routing for a Mixture of LoRA Experts

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

Hybrid Routing for a Mixture of LoRA Experts

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads