Singapore

Large language models (LLMs) have shown impressive capabilities in natural language tasks, yet they continue to struggle with multi-step mathematical reasoning, where correctness depends on a precise chain of intermediate steps. Preference optimization methods such as Direct Preference Optimization (DPO) have improved answer-level alignment, but they often overlook the reasoning process itself, providing little supervision over intermediate steps that are critical for complex problem-solving. Existing fine-grained approaches typically rely on strong annotators or reward models to assess the quality of individual steps. However, reward models are vulnerable to reward hacking. To address this, we propose \textbf{ISLA}, a reward-model-free framework that constructs step-level preference data directly from SFT gold traces. ISLA also introduces a self-improving pruning mechanism that identifies informative steps based on two signals: their marginal contribution to final accuracy (\textit{relative accuracy}) and the model’s \textit{uncertainty}, inspired by the concept of information gain. Empirically, ISLA achieves better performance than DPO while using only 12\% of the training tokens, demonstrating that careful step-level selection can significantly improve both reasoning accuracy and training efficiency.

AAAI 2026

Beyond Step Pruning: Information Theory Based Step-level Optimization for Self-Refining Large Language Models

step pruning

math reasoning

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Graph Neural Networks (GNNs) have been studied from two primary perspectives: spectral, which employs global graph signal filtering and is theoretically more expressive, and spatial, which builds on local neighborhood aggregation and generalizes well across diverse graph structures. While spectral GNNs are expected to perform better in theory, they often underperform in practice compared to spatial models.
To better understand this gap, we introduce a novel theoretical framework for converting spectral GNNs into the spatial domain, allowing for more intuitive analysis. This transformation reveals that signal looping and repeated high-order aggregation are major causes of over-smoothing in spatial GNNs. By addressing these issues in the spatial domain and converting the model back to the spectral domain, we propose DeloopSGNN, a spectral GNN with improved expressive capacity.
Experiments on benchmark datasets show that DeloopSGNN achieves consistently strong performance in terms of accuracy and adversarial robustness, demonstrating that spectral GNNs can benefit significantly from careful architectural design grounded in our proposed framework.

DeloopSGNN: Revisiting Spectral GNNs Through the Lens of Spatial Aggregation

Time-Series (TS) exhibits pronounced non-stationarity. Consequently, most forecasting methods display compromised robustness to concept drift, despite the prevalent application of instance normalization. We tackle this challenge by first analysing concept drift through a bias-variance lens and proving that weighted ensemble reduces variance without increasing bias. These insights motivate DeepBooTS, a novel end-to-end dual-stream residual-decreasing boosting method that progressively reconstructs the intrinsic signal. In our design, each block of a deep model becomes an ensemble of learners with an auxiliary output branch forming a highway to the final prediction. The block‑wise outputs correct the residuals of previous blocks, leading to a learning‑driven decomposition of both inputs and targets. This method enhances versatility and interpretability while substantially improving robustness to concept drift. Extensive experiments, including those on large-scale datasets, show that the proposed method outperforms existing methods by a large margin, yielding an average performance improvement of 15.8% across various datasets, establishing a new benchmark for TS forecasting.

DeepBooTS: Dual-Stream Residual Boosting for Drift-Resilient Time-Series Forecasting

Recently, Few-shot Learning (FSL) has become a popular task that aims to recognize new classes from only a few labeled examples and has been widely applied in fields such as natural science, remote sensing, and medical images.
However, most existing methods focus only on the visual modality and compute prototypes directly from raw support images, which lack comprehensive and rich multimodal information.
To address these limitations, we propose a novel Multimodal Prototype Augmentation FSL framework called MPA, including LLM-based Multi-Variant Semantic Enhancement (LMSE), Hierarchical Multi-View Augmentation (HMA), and an Adaptive Uncertain Class Absorber (AUCA). 
LMSE leverages large language models to generate diverse paraphrased category descriptions, enriching the support set with additional semantic cues. 
HMA exploits both natural and multi-view augmentations to enhance feature diversity (e.g., changes in viewing distance, camera angles, and lighting conditions). 
AUCA models uncertainty by introducing uncertain classes via interpolation and Gaussian sampling, effectively absorbing uncertain samples. 
Extensive experiments on four single-domain and six cross-domain FSL benchmarks demonstrate that MPA consistently outperforms existing state-of-the-art methods by a big margin. Notably, MPA surpasses the second-best method by 12.29\% and 24.56\% in the single-domain and cross-domain setting, respectively, in the 5-way 1-shot setting. All source code will be publicly available.

MPA: Multimodal Prototype Augmentation for Few-Shot Learning

Pretrained vision-language models exhibit strong zero-shot classification capabilities, but their predictions degrade significantly under common image corruptions. To improve robustness, many test-time adaptation (TTA) methods adopt positive data augmentation (PDA), which generates multiple views of each test sample to reduce prediction variance. However, these methods suffer from two key limitations. First, it introduces considerable computational overhead due to the large number of augmentations required per image. Second, it fails to mitigate prediction bias, where the model tends to predict certain classes disproportionately under corruption, as PDA operates on corrupted inputs and typically does not remove the corruption itself. To address these challenges, we propose Panda, a novel TTA method based on negative data augmentation (NDA). Unlike positive augmentations that preserve object semantics, Panda generates negative augmentations by disrupting semantic content. It divides images into patches and randomly assembles them from a shared patch pool. These negatively augmented images retain corruption-specific features while discarding object-relevant signals. We then subtract the mean feature of these negative samples from the original image feature, effectively suppressing corruption-related components while preserving class-relevant information. This mitigates prediction bias under distribution shifts. Importantly, Panda allows augmentation to be shared across samples within a batch, resulting in minimal computational overhead. Panda can be seamlessly integrated into existing test-time adaptation frameworks and substantially improve their robustness. We demonstrate the effectiveness and efficiency of Panda on standard corruption benchmarks. Our experiments indicate that Panda delivers superior performance compared to PDA methods, and a wide range of TTA methods exhibit significantly enhanced performance when integrated with Panda.

Panda: Test-Time Adaptation with Negative Data Augmentation

How can vision-language-action (VLA) models adapt to new environments where world dynamics shift?
While recent research has combined world modeling and action prediction to improve VLA performance, existing methods largely rely on pretraining in static datasets, without mechanisms for active adaptation to new environments. As a result, these models often fail to generalize when deployed in unseen scenarios with novel object configurations or dynamics.

We present WorldAgen, a unified framework that jointly learns world modeling and action prediction while enabling test-time training (TTT) to adapt to new environments. WorldAgen employs a shared Transformer backbone with two heads: (1) a world-model head that predicts future states from past state-action trajectories, and (2) an agent-model head that predicts actions conditioned on task instructions. During test time, WorldAgen samples exploratory actions, collects ground-truth state transitions, and performs lightweight TTT updates to refine its world model. This adaptation improves the model's understanding to the environments and leads to more accurate action predictions.

Experiments on the CALVIN and LIBERO benchmarks demonstrate that our baseline model achieves comparable, and in some cases superior, performance to current state-of-the-art approaches. Moreover, with TTT on a small number of samples, our method surpasses existing state-of-the-art models, highlighting the effectiveness of adapting world models at inference time.

WorldAgen: Unified State-Action Prediction with Test-Time World Model Training

Multi-modal object Re-IDentification (ReID) is devoted to retrieving specific objects through the exploitation of complementary multi-modal image information. Existing methods for multi-modal object ReID mainly concentrate on the fusion of multi-modal features, yet neglecting the background interference. Besides, current multi-modal fusion methods often focus on aligning modality pairs but suffer from multi-modal consistency alignment. To address these issues, we propose a novel selective interaction and global-local alignment framework called \textbf{Signal} for multi-modal object ReID. Specifically, we first propose a Selective Interaction Module (SIM) to select important patch tokens with intra-modal and inter-modal information. These important patch tokens engage in the interaction with class tokens, thereby yielding more discriminative features. Then, we propose a Global Alignment Module (GAM) to simultaneously align multi-modal features by minimizing the volume of 3D polyhedra in the Gramian space.
Meanwhile, we propose a Local Alignment Module (LAM) to align local features in a shift-aware manner. With these modules, our proposed framework could extract more discriminative features for object ReID. Extensive experiments on three multi-modal object ReID benchmarks (i.e., RGBNT201, RGBNT100, MSVR310) validate the effectiveness of our method. The source code is available at .

Signal: Selective Interaction and Global-local Alignment for Multi-Modal Object Re-Identification

Long-term action anticipation from egocentric video is critical for applications such as human-computer interaction and assistive technologies, where anticipating user intent enables proactive and context-aware AI assistance. However, existing approaches suffer from three key limitations: 1) underutilization of fine-grained visual cues from hand-object interactions, 2) neglect of semantic dependencies between verbs and nouns, and 3) lack of explicit cognitive reasoning, limiting generalization and long-term forecasting ability. To overcome these challenges, we propose INSIGHT, a unified two-stage framework for egocentric action anticipation. In the first stage, INSIGHT focuses on extracting semantically rich features from hand-object interaction regions and enhances action representations using a verb-noun co-occurrence matrix. In the second stage, it introduces a reinforcement learning-based module that simulates explicit cognitive reasoning through a structured process: visual perception (think) → intention inference (reason) → action anticipation (answer). Extensive experiments on Ego4D, EPIC-Kitchens-55, and EGTEA Gaze+ benchmarks show that INSIGHT achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.

Intention-Guided Cognitive Reasoning for Egocentric Long-Term Action Anticipation

Latent World Models enhance scene representation through temporal self-supervised learning, presenting a perception annotation-free paradigm for end-to-end autonomous driving. However, the reconstruction-oriented representation learning tangles perception with planning tasks, leading to suboptimal optimization for planning. To address this challenge, we propose WorldRFT, a planning-oriented latent world model framework that aligns scene representation learning with planning via a hierarchical planning decomposition and local-aware interactive refinement mechanism, augmented by reinforcement learning fine-tuning (RFT) to enhance safety-critical policy performance. Specifically, WorldRFT integrates a vision-geometry foundation model to improve 3D spatial awareness, employs hierarchical planning task decomposition to guide representation optimization, and utilizes local-aware iterative refinement to derive a planning-oriented driving policy. Furthermore, we introduce Group Relative Policy Optimization (GRPO), which applies trajectory Gaussianization and collision-aware rewards to fine-tune the driving policy, yielding systematic improvements in safety. WorldRFT achieves state-of-the-art (SOTA) performance on both open-loop nuScenes and closed-loop NavSim benchmarks. On nuScenes, it reduces collision rates by 83% (0.30% → 0.05%). On NavSim, using camera-only sensors input, it attains competitive performance with the LiDAR-based SOTA method DiffusionDrive (87.8 vs. 88.1 PDMS).

WorldRFT: Latent World Model Planning with Reinforcement Fine-Tuning for Autonomous Driving

Out-of-distribution (OOD) detection is committed to delineating the classification boundaries between in-distribution (ID) and OOD images. Recent advances in vision-language models (VLMs) have demonstrated remarkable OOD detection performance by integrating both visual and textual modalities. In this context, negative prompts are introduced to emphasize the dissimilarity between image features and prompt content. However, these prompts often include a broad range of non-ID features, which may result in suboptimal outcomes due to the capture of overlapping or misleading information. To address this issue, we propose Positive and Negative Prompt Supervision, which encourages negative prompts to capture inter-class features and transfers this semantic knowledge to the visual modality to enhance OOD detection performance. Our method begins with class-specific positive and negative prompts initialized by large language models (LLMs). These prompts are subsequently optimized, with positive prompts focusing on features within each class, while negative prompts highlight features around category boundaries. Additionally, a graph-based architecture is employed to aggregate semantic-aware supervision from the optimized prompt representations and propagate it to the visual branch, thereby enhancing the performance of the energy-based OOD detector. Extensive experiments on two benchmarks, CIFAR-100 and ImageNet-1K, across eight OOD datasets and five different LLMs, demonstrate that our method outperforms state-of-the-art baselines.

Out-of-Distribution Detection with Positive and Negative Prompt Supervision Using Large Language Models

Bayesian networks play a crucial role in various domains for unsupervised feature extraction and data interpretation. The Poisson gamma belief networks (PGBNs), as a type of Bayesian networks, have shown promise in analyzing high-dimensional count data. However, PGBNs encounter significant challenges when applied to sparse data, particularly in achieving accurate feature extraction and avoiding overfitting during missing value prediction. In this paper, we propose the sparse Poisson gamma belief networks (SPGBNs), a Bayesian network model designed to address these limitations. By incorporating sparse graph-structured priors over the weight matrices between adjacent layers, the proposed SPGBNs effectively capture the inherent sparsity and graph structures of latent features. Meanwhile, SPGBNs demonstrate superior generalization on missing data prediction and enable more stable extraction of meaningful latent features compared to existing approaches. Additionally, we develop an efficient Gibbs sampling algorithm that significantly improves the training stability and computational efficiency of SPGBNs. Extensive experiments on real-world datasets are conducted to validate the effectiveness of our approach.

Content not yet available

Next from AAAI 2026

DeloopSGNN: Revisiting Spectral GNNs Through the Lens of Spatial Aggregation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES