Singapore

Exploration is critical for cooperative multi-agent reinforcement learning (MARL) to improve sample efficiency. However, existing intrinsic motivation-based exploration strategies in MARL overlook the causal relationships among agents, global states, and rewards, suffering from interference by irrelevant factors and resulting in sample inefficiency. 
To address this issue, we propose Causality-aware Efficient Exploration (CEE), a novel framework that enhances sample efficiency by inferring causal relationships between agents, global states with respect to rewards, thereby enabling causality-guided exploration. Specifically, CEE operates through two components. First, CEE identifies causal relationships between global states and rewards, filtering out causally irrelevant state features that do not have a high impact on rewards to keep decision-critical state information. Second, CEE discovers causal relationships between agents&#39; behaviors and rewards to quantify each agent&#39;s contribution to collective performance. To achieve this, we introduce a causal entropy objective that promotes exploration aligned with decision-critical aspects of the underlying causal structure. We provide comprehensive validation through experiments on $21$ challenging tasks spanning SMAC, SMAC-v2, and Google Research Football (GRF) environments. Our results demonstrate that CEE achieves superior performance in terms of sample efficiency and asymptotic performance compared to existing MARL methods.

AAAI 2026

Causality-Aware Efficient Exploration for Cooperative Multi-Agent Reinforcement Learning

multi-agent system

causality

reinforcement learning

Exploration is critical for cooperative multi-agent reinforcement learning (MARL) to improve sample efficiency. However, existing intrinsic motivation-based exploration strategies in MARL overlook the causal relationships among agents, global states, and rewards, suffering from interference by irrelevant factors and resulting in sample inefficiency. 
To address this issue, we propose Causality-aware Efficient Exploration (CEE), a novel framework that enhances sample efficiency by inferring causal relationships between agents, global states with respect to rewards, thereby enabling causality-guided exploration. Specifically, CEE operates through two components. First, CEE identifies causal relationships between global states and rewards, filtering out causally irrelevant state features that do not have a high impact on rewards to keep decision-critical state information. Second, CEE discovers causal relationships between agents' behaviors and rewards to quantify each agent's contribution to collective performance. To achieve this, we introduce a causal entropy objective that promotes exploration aligned with decision-critical aspects of the underlying causal structure. We provide comprehensive validation through experiments on $21$ challenging tasks spanning SMAC, SMAC-v2, and Google Research Football (GRF) environments. Our results demonstrate that CEE achieves superior performance in terms of sample efficiency and asymptotic performance compared to existing MARL methods.

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Time series forecasting is fundamental to diverse applications, with recent approaches leverage large vision models (LVMs) to capture temporal patterns through visual representations. We reveal that while vision models enhance forecasting performance, 99\% of their parameters are unnecessary for time series tasks. Through cross-modal analysis, we find that time series align with low-level textural features but not high-level semantics, which can impair forecasting accuracy. We propose OccamVTS, a knowledge distillation framework that extracts only the essential 1\% of predictive information from LVMs into lightweight networks. Using pre-trained LVMs as privileged teachers, OccamVTS employs pyramid-style feature alignment combined with correlation and feature distillation to transfer beneficial patterns while filtering out semantic noise. Counterintuitively, this aggressive parameter reduction improves accuracy by eliminating overfitting to irrelevant visual features while preserving essential temporal patterns. Extensive experiments across multiple benchmark datasets demonstrate that OccamVTS consistently achieves state-of-the-art performance with only 1\% of the original parameters, particularly excelling in few-shot and zero-shot scenarios.

OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting

Continual instruction tuning aims to incrementally adapt large language models to new tasks without forgetting previously acquired knowledge. Existing approaches often struggle to balance plasticity and stability. Replay-based methods retrain on historical data, which raises privacy concerns. Architecture-based methods allocate task-specific components, resulting in significant parameter growth. To address this, we consider a structure-sharing strategy that enables parameter reuse across similar tasks and expands only when necessary, avoiding any data replay. Specifically, we introduce Grow-on-Demand (GoD-MoE), a parameter-efficient framework that is based on sparse and adaptive expert module expansion for continual instruction tuning. GoD-MoE inserts multiple LoRA-based experts into attention layers and dynamically activates a small subset of experts for each task. To avoid redundant parameter growth, we develop an Expert Demand Detector that determines whether new experts are added, facilitating adaptive structural sharing and minimizing parameter overhead. We conduct comprehensive experiments on the TRACE benchmark, demonstrating that GoD-MoE achieves state-of-the-art performance. Furthermore, it effectively mitigates catastrophic forgetting and even outperforms several advanced replay-based baselines.

Grow-on-Demand: Sparse and Adaptive Expert Expansion for Continual Instruction Tuning

Data-protection regulations such as the GDPR grant every participant in a federated system a right to be forgotten. Federated unlearning has therefore emerged as a research frontier, aiming to remove a specific party's contribution from the learned model while preserving the utility of the remaining parties. However, most unlearning techniques focus on Horizontal Federated Learning (HFL), where data are partitioned by samples. In contrast, Vertical Federated Learning (VFL) allows organizations that possess complementary feature spaces to train a joint model without sharing raw data. The resulting feature-partitioned architecture renders HFL-oriented unlearning methods ineffective. In this paper, we propose ReMisVFU, a plug-and-play representation-misdirection framework that enables fast, client-level unlearning in splitVFL systems. When a deletion request arrives, the forgetting party collapses its encoder output to a randomly sampled anchor on the unit sphere, severing the statistical link between its features and the global model. To maintain utility for the remaining parties, the server jointly optimizes a retention loss and a forgetting loss, aligning their gradients via orthogonal projection to eliminate destructive interference. Evaluations on public benchmarks show that ReMisVFU suppresses back-door attack success to the natural class-prior level and sacrifices only about 2.5% points of clean accuracy, outperforming state-of-the-art baselines.

REMISVFU: Vertical Federated Unlearning via Representation Misdirection for Intermediate Output Feature

In this paper, we aim to create physical digital twins of deformable objects under interaction. Existing methods focus more on the physical learning of current state modeling, but generalize worse to future prediction. This is because existing methods ignore the intrinsic physical properties of deformable objects, resulting in the limited physical learning in the current state modeling. 
To address this, we present NeuSpring, a neural spring field for the reconstruction and simulation of deformable objects from videos. Built upon spring-mass models for realist physical simulation, our method consists of two major innovations: 1) a piecewise topology solution that efficiently models multi-region spring connection topologies using zero-order optimization, which considers the material heterogeneity‌‌ of real-world objects. 2) a neural spring field that represents spring physical properties across different frames using a canonical coordinate-based neural network, which effectively leverages the spatial associativity of springs for physical learning. Experiments on real-world datasets demonstrate that our NeuSping achieves superior reconstruction and simulation performance for current state modeling and future prediction, with Chamfer distance improved by **20\%** and **25\%**, respectively.

NeuSpring: Neural Spring Fields for Reconstruction and Simulation of Deformable Objects from Videos

Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing images based on free-form natural language expressions. Existing approaches are typically constrained to closed-set vocabularies, limiting their applicability in open-world scenarios. While recent attempts to leverage generic foundation models for open-vocabulary RSVG, they overly rely on expensive high-quality datasets and time-consuming fine-tuning. To address these limitations, we propose $\textbf{RSVG-ZeroOV,}$ a training-free framework that aims to explore the potential of frozen generic foundation models for zero-shot open-vocabulary RSVG. Specifically, RSVG-ZeroOV comprises three key stages: $\textit{(i) Overview:}$ We utilize a vision-language model (VLM) to obtain cross-attention maps that capture semantic correlations between text queries and visual regions. $\textit{(ii) Focus:}$ By leveraging the fine-grained modeling priors of a diffusion model (DM), we fill in gaps in structural and shape information of objects, which are often overlooked by VLM. $\textit{(iii) Evolve:}$ A simple yet effective attention evolution module is introduced to suppress irrelevant activations, yielding purified segmentation masks over the referred objects. Without cumbersome task-specific training, RSVG-ZeroOV offers an efficient and scalable solution. Extensive experiments demonstrate that the proposed framework consistently outperforms existing weakly-supervised and zero-shot methods.

RSVG-ZeroOV: Exploring a Training-Free Framework for Zero-Shot Open-Vocabulary Visual Grounding in Remote Sensing Images

Surface reconstruction has been widely studied in computer vision and graphics. However, existing surface reconstruction works struggle to recover accurate scene geometry when the input views are extremely sparse. To address this issue, we propose MeshSplat, a generalizable sparse-view surface reconstruction framework via Gaussian Splatting. Our key idea is to leverage 2DGS as a bridge, which connects novel view synthesis to learned geometric priors and then transfers these priors to achieve surface reconstruction. Specifically, we incorporate a feed-forward network to predict per-view pixel-aligned 2DGS, which enables the network to synthesize novel view images and thus eliminates the need for direct 3D ground-truth supervision. To improve the accuracy of 2DGS position and orientation prediction, we propose a Weighted Chamfer Distance Loss to regularize the depth maps, especially in overlapping areas of input views, and also a normal prediction network to align the orientation of 2DGS with normal vectors predicted by a monocular normal estimator. Extensive experiments validate the effectiveness of our proposed improvement, demonstrating that our method achieves state-of-the-art performance in generalizable sparse-view mesh reconstruction tasks.

MeshSplat: Generalizable Sparse-View Surface Reconstruction via Gaussian Splatting

Gait recognition offers a non-intrusive biometric solution by identifying individuals through their walking patterns. Although discriminative models have achieved notable success in this domain, the full potential of generative models remains largely unexplored. In this paper, we introduce CoD², a novel framework that combines the data distribution modeling capabilities of diffusion models with the semantic representation learning strengths of discriminative models to extract robust gait features. We propose a Multi-level Conditional Control strategy that integrates both high-level identity-aware semantic conditions and low-level visual details. Specifically, the high-level condition, extracted by the discriminative extractor, guides the generation of identity-consistent gait sequences, while low-level visual details, such as appearance and motion, are preserved to enhance consistency. Moreover, the generated sequences facilitate the discriminative extractor's learning, enabling it to capture more comprehensive high-level semantic features. Extensive experiments on four datasets (SUSTech1K, CCPG, GREW, and Gait3D) demonstrate that CoD² achieves state-of-the-art performance and can be seamlessly integrated with existing discriminative methods, yielding consistent improvements.

Gait Recognition via Collaborating Discriminative and Generative Diffusion Models

Spatiotemporal forecasting often relies on computationally intensive models to capture complex dynamics. Knowledge distillation (KD) has emerged as a key technique for creating lightweight student models, with recent advances like frequency-aware KD successfully preserving spectral properties (i.e., high-frequency details and low-frequency trends). However, these methods are fundamentally constrained by operating on pixel-level signals, leaving them blind to the rich semantic and causal context behind the visual patterns. To overcome this limitation, we introduce \textbf{S$^2$-KD}, a novel framework that unifies \textbf{S}emantic priors with \textbf{S}pectral representations for distillation. Our approach begins by training a privileged, multimodal \textbf{teacher} model. This teacher leverages textual narratives from a Large Multimodal Model (LMM) to reason about the underlying causes of events, while its architecture simultaneously decouples spectral components in its latent space. The core of our framework is a new distillation objective that transfers this unified semantic-spectral knowledge into a lightweight, \textbf{vision-only student}. Consequently, the student learns to make predictions that are not only spectrally accurate but also semantically coherent, without requiring any textual input or architectural overhead at inference. Extensive experiments on benchmarks like WeatherBench and TaxiBJ+ show that S$^2$-KD significantly boosts the performance of simple student models, enabling them to outperform state-of-the-art methods, particularly in long-horizon and complex non-stationary scenarios.

S^2-KD: Semantic-Spectral Knowledge Distillation Spatiotemporal Forecasting

Recent advances in transformer-based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these achievements, existing methods universally employ sparse sampling during training—utilizing only one template and one search image per sequence—which fails to comprehensively explore spatiotemporal information in videos. This limitation constrains performance and causes the gap between lightweight and high-performance trackers. To bridge this divide while maintaining real-time efficiency, we propose STDTrack, a framework that pioneers the integration of reliable spatiotemporal dependencies into lightweight trackers. Our approach implements dense video sampling to maximize spatiotemporal information utilization. We introduce a temporally propagating spatiotemporal token to guide per-frame feature extraction. To ensure comprehensive target state representation, we design the Multi-frame Information Fusion Module (MFIFM), which augments current dependencies using historical context. The MFIFM operates on features stored in our constructed Spatiotemporal Token Maintainer (STM), where a quality-based update mechanism ensures information reliability. Considering the scale variation among tracking targets, we develop a multi-scale prediction head to dynamically adapt to objects of different sizes. Extensive experiments demonstrate state-of-the-art results across six benchmarks. Notably, on GOT-10k, STDTrack rivals certain high-performance non-real-time trackers (e.g., MixFormer) while operating at 192 FPS (GPU) and 41 FPS (CPU).

Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking

Data augmentation is an intuitive solution to increase the diversity of training instances in the machine learning community. Mixup is acknowledged as an effective and efficient mix-based data augmentation method, following a linear alignment assumption that the linear interpolations of features align the corresponding linear interpolations of labels. Unfortunately, this assumption can be violated in many complex scenarios, resulting in augmented instances with noisy labels, especially for regression problems. To solve this problem, we propose an easy-to-implement mixup method, namely DEnosing MIXUP (DE-mixup), which iteratively corrects the noisy response targets by leveraging an auxiliary noise estimation task with mixup deep features. Additionally, we suggest an efficient optimization method with alternating direction method of multipliers. We compare DE-mixup with the existing mixup variants and other prevalent data augmentation methods across benchmark regression datasets. Empirical results indicate the effectiveness of DE-mixup under the in-distribution and out-of-distribution cases.

Downloads

Next from AAAI 2026

OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

.css-70qvj9{display:-webkit-box;display:-webkit-flex;display:-ms-flexbox;display:flex;-webkit-align-items:center;-webkit-box-align:center;-ms-flex-align:center;align-items:center;}Downloads

Next from AAAI 2026

OccamVTS: Distilling Vision Models to 1% Parameters for Time Series Forecasting

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES

Downloads