Singapore

The multi-path commodity flow problem (MPCFP) is crucial for ensuring reliable and high-speed data transmission in communication networks. However, existing studies that employ pre-generated routing paths neglect real-time load state and the coupling among decisions, thus hindering the achievement of high-quality solutions. To overcome this, we propose Hierarchical Reinforcement Learning with Topology-Aware Exploration (HRL-TAE), which is the first fully end-to-end framework that dynamically produces high-quality solutions based on real-time network states. HRL-TAE integrates an exploration mechanism and utilizes the State Transition Guiding List (STGL) to guide state transitions, thereby transforming topology exploration into a Markov decision process. Guided by STGL, two closely coupled layers in HRL-TAE, that is, the path construct layer and the ratio allocate layer, construct multiple subpaths for each flow and allocate traffic ratios among them. Subsequently, adaptive constraint-driven masks exclude infeasible actions during decision making, thereby guaranteeing that all constraints are satisfied. We also adopt a tailored training approach to obtain accurate gradient estimates and improve training efficiency. Simulations and real-world experiments demonstrate that HRL-TAE achieves superior performance.

AAAI 2026

Hierarchical Reinforcement Learning with Topology-Aware Exploration Framework for Multi-path Commodity Flow Problem

prs: routing

ml: reinforcement learning

prs: planning/scheduling and learning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Multi-behavior sequential recommendation aims to capture users' dynamic interests by modeling diverse types of user interactions over time. Although several studies have explored this setting, the recommendation performance remains suboptimal, mainly due to two fundamental challenges: the heterogeneity of user behaviors and data sparsity. To address these challenges, we propose BLADE, a framework that enhances multi-behavior modeling while mitigating data sparsity. Specifically, to handle behavior heterogeneity, we introduce a dual item-behavior fusion architecture that incorporates behavior information at both the input and intermediate levels, enabling preference modeling from multiple perspectives. To mitigate data sparsity, we design three behavior-level data augmentation methods that operate directly on behavior sequences rather than core item sequences. These methods generate diverse augmented views while preserving the semantic consistency of item sequences. These augmented views further enhance representation learning and generalization via contrastive learning. Extensive experiments on three real-world datasets demonstrate the effectiveness of our approach. The code is available at: \url{https://anonymous.4open.science/r/BLADE-B189/}.

BLADE: A Behavior-Level Data Augmentation Framework with Dual Fusion Modeling for Multi-Behavior Sequential Recommendation

Bandit multiple hypothesis testing has broad applications in biological sciences, clinical testing for drug discovery, and online A/B/n testing. The framework utilizes an adaptive sampling strategy for multiple testing which aims to maximize statistical power while ensuring anytime false discovery rate control. This paper proposes a robust approach for bandit multiple testing, allowing for (at most) $\varepsilon$ fraction of arbitrary distribution corruption, as in Huber's contamination model. Specifically, we introduce two adaptive sampling strategies designed to minimize the number of samples required to exceed a target true positive rate, while providing anytime control over the false discovery rate. We analyze the sample complexity of our proposed methods and perform numerical simulations to demonstrate their efficiency and robustness. Furthermore, we extend our methods to address scenarios where distributions have infinite variance and situations involving multiple agents collaborating on the same bandit task.

On the Robustness of Bandit Multiple Testing

Hyperparameter Optimization (HPO) is crucial in machine learning, aiming to optimize hyperparameters to enhance model performance. Although existing methods that leverage prior knowledge—drawn from either previous experiments or expert insights—can accelerate optimization, acquiring a correct prior for a specific HPO task is non-trivial. In this work, we propose to relief the reliance on external knowledge by learning a reliable prior \emph{directly} from low-fidelity (LF) problems. We introduce $\texttt{Lamda}$ , an algorithm-agnostic framework designed to boost any baseline HPO algorithm. Specifically, $\texttt{Lamda}$ operates in two phases: (1) it learns a reliable prior by exploring the LF landscape under limited computational budgets, and (2) it leverages this learned prior to guide the HPO process. We showcase how the $\texttt{Lamda}$ framework can be integrated with various HPO algorithms to boost their performance, and further conduct theoretical analysis towards the integrated Bayesian optimization and bandit-based Hyperband. We conduct experiments on $56$ HPO problems spanning diverse domains and model scales. Results show that \texttt{Lamda} consistently enhances its baseline algorithms. Compared to nine state-of-the-art HPO algorithms, our $\texttt{Lamda}$ variant achieves the best performance in $51$ out of $56$ HPO tasks while it is the second best algorithm in the other $5$ cases.

LAMDA: Two-Phase HPO via Learning Prior from Low-Fidelity Data

Fine-tuning large language models (LLMs) in federated settings enables privacy-preserving adaptation but suffers from cross-client interference due to model aggregation. Existing federated LoRA fine-tuning methods, primarily based on FedAvg, struggle with data heterogeneity, leading to harmful cross-client interference and suboptimal personalization. In this work, we propose \textbf{FedALT}, a novel personalized federated LoRA fine-tuning algorithm that fundamentally departs from FedAvg. Instead of using an aggregated model to initialize local training, each client continues training its individual LoRA while incorporating shared knowledge through a separate Rest-of-World (RoW) LoRA component. To effectively balance local adaptation and global information, FedALT introduces an adaptive mixer that dynamically learns input-specific weightings between the individual and RoW LoRA components, drawing conceptual foundations from the Mixture-of-Experts (MoE) paradigm. Through extensive experiments on NLP benchmarks, we demonstrate that FedALT significantly outperforms state-of-the-art personalized federated LoRA fine-tuning methods, achieving superior local adaptation without sacrificing computational efficiency.

FedALT: Federated Fine-Tuning Through Adaptive Local Training with Rest-of-World LoRA

Recent studies have shown that deep learning models are very vulnerable to poisoning attacks. Many defense methods have been proposed to address this issue. However, traditional poisoning attacks are not as threatening as commonly believed. This is because they often cause differences in how the model performs on the training set compared to the validation set. Such inconsistency can alert defenders that their data has been poisoned, allowing them to take the necessary defensive actions. In this paper, we introduce a more threatening type of poisoning attack called the Deferred Poisoning Attack. This new attack allows the model to function normally during the training and validation phases but makes it very sensitive to evasion attacks or even natural noise. We achieve this by ensuring the poisoned model's loss function has a similar value as a normally trained model at each input sample but with a large local curvature. A similar model loss ensures that there is no obvious inconsistency between the training and validation accuracy, demonstrating high stealthiness. On the other hand, the large curvature implies that a small perturbation may cause a significant increase in model loss, leading to substantial performance degradation, which reflects a worse robustness. We fulfill this purpose by making the model have singular Hessian information at the optimal point via our proposed Singularization Regularization term. We have conducted both theoretical and empirical analyses of the proposed method and validated its effectiveness through experiments on image classification tasks. Furthermore, we have confirmed the hazards of this form of poisoning attack under more general scenarios using natural noise, offering a new perspective for research in the field of security.

Deferred Poisoning: Making the Model More Vulnerable via Hessian Singularization

Time series forecasting relies on predicting future values from historical data, yet most state-of-the-art approaches—including transformer and multilayer perceptron-based models—optimize using Mean Squared Error (MSE), which has two fundamental weaknesses: its point-wise error computation fails to capture temporal relationships, and it does not account for inherent noise in the data. To overcome these limitations, we introduce the Residual-Informed Loss (RI-Loss), a novel objective function based on the Hilbert-Schmidt Independence Criterion (HSIC). RI-Loss explicitly models noise structure by enforcing dependence between the residual sequence and a random time series, enabling more robust, noise-aware representations. Theoretically, we derive the first non-asymptotic HSIC bound with explicit double-sample complexity terms, achieving optimal convergence rates through Bernstein-type concentration inequalities and Rademacher complexity analysis. This provides rigorous guarantees for RI-Loss optimization while precisely quantifying kernel space interactions. Empirically, experiments across eight real-world benchmarks and five leading forecasting models demonstrate improvements in predictive performance, validating the effectiveness of our approach. Code will be made publicly available to ensure reproducibility.

RI-Loss: A Learnable Residual-Informed Loss for Time Series Forecasting

The safety alignment of large language models (LLMs) often relies on reinforcement learning from human feedback (RLHF), which requires human annotations to construct preference datasets. Given the challenge of assigning overall quality scores to data, recent works increasingly adopt fine-grained ratings based on multiple safety rules. In this paper, we discover a robust phenomenon: Rules with higher rating entropy tend to have lower accuracy in distinguishing human-preferred responses. Exploiting this insight, we propose ENCORE, a simple entropy-guided method to compose multi-head rewards by penalizing rules with high rating entropy. Theoretically, we show that such rules yield negligible weights under the Bradley–Terry loss during weight optimization, naturally justifying their penalization. Empirically, ENCORE consistently outperforms strong baselines, including random and uniform weighting, single-head Bradley–Terry, and LLM-as-a-judge, etc. on RewardBench safety tasks. Our method is completely training-free, generally applicable across datasets, and retains interpretability, making it a practical and effective approach for multi-attribute reward modeling.

ENCORE: Entropy-guided Reward Composition for Multi-head Safety Reward Models

Implicit Neural Representations (INRs) have revolutionized signal processing and computer vision by modeling signals as continuous, differentiable functions parameterized by neural networks. However, INRs are prone to the spectral bias problem, limiting their ability to retain high-frequency information, and often struggle with noise robustness. Motivated by recent trends in iterative refinement processes, we propose Iterative Implicit Neural Representations (I-INRs), a novel plug-and-play framework that iteratively refines signal reconstructions to restore high-frequency details, improve noise robustness, and enhance generalization, ultimately delivering superior reconstruction quality. I-INRs integrate seamlessly into existing INR architectures with only a 0.5–2\% increase in parameters. During reconstruction, the iterative refinement adds just 0.8–1.6\% additional FLOPs over the baseline while delivering a substantial performance boost of up to +2.0 PSNR. Extensive experiments demonstrate that I-INRs consistently outperform WIRE, SIREN, and Gauss across diverse computer-vision tasks, including image fitting, image denoising, and object-occupancy prediction.

I-INR: Iterative Implicit Neural Representations

Stochastic dynamical systems have emerged as fundamental models across numerous application domains, providing powerful mathematical representations for capturing uncertain system behavior. In this paper, we address the problem of runtime safety and reach-avoid probability prediction for discrete-time stochastic systems with online observations, i.e., estimating the probability that the system satisfies a given safety or reach-avoid specification. Unlike traditional approaches that rely solely on offline models, we propose a framework that incorporates real-time observations to dynamically refine probability estimates for safety and reach-avoid events. By introducing observation-aware barrier functions, our method adaptively updates probability bounds as new observations are collected, combining efficient offline computation with online backward iteration. This approach enables rigorous and responsive prediction of safety and reach-avoid probabilities under uncertainty. In addition to the theoretical guarantees, experimental results on benchmark systems demonstrate the practical effectiveness of the proposed method.

Runtime Safety and Reach-avoid Prediction of Stochastic Systems via Observation-aware Barrier Functions

Large Language Models (LLMs) have achieved impressive performance in complex reasoning problems. Their effectiveness highly depends on the specific nature of the task, especially the required domain knowledge. Existing approaches, such as mixture-of-experts, typically operate at the task level; they are too coarse to effectively solve the heterogeneous problems involving multiple subjects. This work proposes a novel framework that performs fine-grained analysis at subject level equipped with a designated multi-agent collaboration strategy for addressing heterogeneous problem reasoning. Specifically, given an input query, we first employ a Graph Neural Network to identify the relevant subjects and infer their interdependencies to generate an Subject-based Directed Acyclic Graph (S-DAG), where nodes represent subjects and edges encode information flow. Then we profile the LLM models by assigning each model a subject-specific expertise score, and select the top-performing one for matching corresponding subject of the S-DAG. Such subject-model matching enables graph-structured multi-agent collaboration where information flows from the starting model to the ending model over S-DAG. We curate and release multi-subject subsets of standard benchmarks (MMLU-Pro, GPQA, MedMCQA) to better reflect complex, real-world reasoning tasks. Extensive experiments show that our approach significantly outperforms existing task-level model selection and multi-agent collaboration baselines in accuracy and efficiency. These results highlight the effectiveness of subject-aware reasoning and structured collaboration in addressing complex and multi-subject problems.

Downloads

Next from AAAI 2026

BLADE: A Behavior-Level Data Augmentation Framework with Dual Fusion Modeling for Multi-Behavior Sequential Recommendation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

BLADE: A Behavior-Level Data Augmentation Framework with Dual Fusion Modeling for Multi-Behavior Sequential Recommendation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads