Singapore

Large language models have revolutionized agent planning by serving as the central reasoning engine. However, LLM-based agents often struggle to generalize across complex environments and to adapt to stochastic feedback arising from environment–action interactions. We propose Counterfactual Planning—a method designed to improve the generalizability and adaptability of agents&#39; actions by inferring the causal representation of environmental confounders and performing counterfactual reasoning over planned actions. We formalize the agent planning process as a structural causal model, providing a mathematical formulation for causal analysis of how environmental states influence action generation and how actions affect future state transitions. To support generalizable action planning, we introduce the State Causality Evaluator (SCE), which dynamically infers task-conditioned causal representations from complex environment states; and to enhance adaptability under stochastic feedback, we propose the What-If-Not (WIN) reward, which performs counterfactual interventions to finish action refinement through causal evaluation. We validate our framework in an open-world environment, where experiments demonstrate improvements in both action generalization and planning adaptability.

AAAI 2026

Counterfactual Planning for Generalizable Agents’ Actions

causal inference，agentic ai

agent planning

Large language models have revolutionized agent planning by serving as the central reasoning engine. However, LLM-based agents often struggle to generalize across complex environments and to adapt to stochastic feedback arising from environment–action interactions. We propose Counterfactual Planning—a method designed to improve the generalizability and adaptability of agents' actions by inferring the causal representation of environmental confounders and performing counterfactual reasoning over planned actions. We formalize the agent planning process as a structural causal model, providing a mathematical formulation for causal analysis of how environmental states influence action generation and how actions affect future state transitions. To support generalizable action planning, we introduce the State Causality Evaluator (SCE), which dynamically infers task-conditioned causal representations from complex environment states; and to enhance adaptability under stochastic feedback, we propose the What-If-Not (WIN) reward, which performs counterfactual interventions to finish action refinement through causal evaluation. We validate our framework in an open-world environment, where experiments demonstrate improvements in both action generalization and planning adaptability.

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Dynamic sequences with varying lengths have been widely used in the training of Transformer-based large language models (LLMs).
However, current training frameworks adopt a pre-defined static parallel strategy for these sequences, causing neither communication-parallelization cancellation on short sequences nor out-of-memory on long sequences.
To mitigate these issues, we propose ParaDySe, a novel adaptive Parallel strategy switching framework for Dynamic Sequences.
ParaDySe enables on-the-fly optimal strategy adoption according to the immediate input sequence.
It first implements the modular function libraries for parallel strategies with unified tensor layout specifications,
and then builds sequence-aware memory and time cost models with hybrid methods. 
Guided by cost models, ParaDySe selects optimal layer-wise strategies for dynamic sequences via an efficient heuristic algorithm.
By integrating these techniques together, ParaDySe achieves seamless hot-switching of optimal strategies through its well-designed function libraries.
We compare ParaDySe with baselines on representative LLMs under datasets with sequence lengths up to 624K. 
Experimental results indicate that ParaDySe addresses OOM and CPC bottlenecks in LLM training by systematically integrating long-sequence optimizations with existing frameworks.

ParaDySe: A Parallel Strategy Switching Framework for Dynamic Sequences in Transformer-based Large Language Models

Compared to single-target adversarial attacks, multi-target attacks have garnered significant attention due to their ability to generate adversarial images for multiple target classes simultaneously. However, existing generative approaches for multi-target attacks primarily encode target labels into one-dimensional tensors, leading to a loss of fine-grained visual information and overfitting to model-specific features during noise generation. To address this gap, we first identify and validate that the semantic feature quality and quantity are critical factors affecting the transferability of targeted attacks: 1) Feature quality refers to the structural and detailed completeness of the implanted target features, as deficiencies may result in the loss of key discriminative information; 2) Feature quantity refers to the spatial sufficiency of the implanted target features, as inadequacy limits the victim model's attention to this feature. Based on these findings, we propose the 2D Tensor-Guided Adversarial Fusion (TGAF) framework, which leverages the powerful generative capabilities of diffusion models to encode target labels into two-dimensional semantic tensors for guiding adversarial noise generation. Additionally, we design a novel masking strategy tailored for the training process, ensuring that parts of the generated noise retain complete semantic information about the target class. Extensive experiments demonstrate that TGAF consistently surpasses state-of-the-art methods across various settings.

Rethinking Target Label Conditioning in Adversarial Attacks: A 2D Tensor-Guided Generative Approach

Clean-image backdoor attacks, which use only label manipulation in training datasets to compromise deep neural networks, pose a significant threat to security-critical applications. A critical flaw in existing methods is that the poison rate required for a successful attack induces a proportional, and thus noticeable, drop in Clean Accuracy (CA), undermining their stealthiness. This paper presents a new paradigm for clean-image attacks that minimizes this accuracy degradation by optimizing the trigger itself. We introduce Generative Clean-Image Backdoors (GCB), a framework that uses a conditional InfoGAN to identify naturally occurring image features that can serve as potent and stealthy triggers. By ensuring these triggers are easily separable from benign task-related features, GCB enables a victim model to learn the backdoor from an extremely small set of poisoned examples, resulting in a CA drop of less than 1\%. Our experiments demonstrate GCB's remarkable versatility, successfully adapting to six datasets, five architectures, and four tasks, including the first demonstration of clean-image backdoors in regression and segmentation. GCB also exhibits resilience against most of the existing backdoor defenses.

Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization

Graph-based methods have proven to be effective in capturing relationships among points for 3D point cloud analysis. However, these methods often suffer from suboptimal graph structures, particularly due to sparse connections at boundary points and noisy connections in junction areas. To address these challenges, we propose a novel method that integrates a graph smoothing module with an enhanced local geometry learning module. Specifically, we identify the limitations of conventional graph structures, particularly in handling boundary points and junction areas. In response, we introduce a graph smoothing module designed to optimize the graph structure and minimize the negative impact of unreliable sparse and noisy connections. Based on the optimized graph structure, we improve the feature extract function with local geometry information. These include shape features derived from adaptive geometric descriptors based on eigenvectors and distribution features obtained through cylindrical coordinate transformation. Experimental results on real-world datasets validate the effectiveness of our method in various point cloud learning tasks, i.e., classification, part segmentation, and semantic segmentation.

Graph Smoothing for Enhanced Local Geometry Learning in Point Cloud Analysis

With the rapid and continuous increase in academic publications, identifying high-quality research has become an increasingly pressing challenge. While recent methods leveraging Large Language Models (LLMs) for automated paper evaluation have shown great promise, they are often constrained by outdated domain knowledge and limited reasoning capabilities. In this work, we present \textbf{\textit{PaperEval}}, a novel LLM-based framework for automated paper evaluation that addresses these limitations through two key components: 1) a domain-aware paper retrieval module that retrieves relevant concurrent work to support contextualized assessments of novelty and contributions, and 2) a latent reasoning mechanism that enables deep understanding of complex motivations and methodologies, along with comprehensive comparison against concurrently related work, to support more accurate and reliable evaluation. 
To guide the reasoning process, we introduce a progressive ranking optimization strategy that encourages the LLM to iteratively refine its predictions with an emphasis on relative comparison. 
Experiments on two datasets demonstrate that PaperEval consistently outperforms existing methods in both academic impact and paper quality evaluation. In addition, we deploy PaperEval in a real-world paper recommendation system for filtering high-quality papers, which has gained strong engagement on social media---amassing over 8,000 subscribers and attracting over 10,000 views for many filtered high-quality papers---demonstrating the practical effectiveness of PaperEval.

Navigating Through Paper Flood: Advancing LLM-Based Paper Evaluation Through Domain-Aware Retrieval and Latent Reasoning

Region representation learning plays a pivotal role in urban computing by extracting meaningful features from unlabeled urban data. Analogous to how perceived facial age reflects an individual's health, the visual appearance of a city serves as its "portrait", encapsulating latent socio-economic and environmental characteristics. Recent studies have explored leveraging Large Language Models (LLMs) to incorporate textual knowledge into imagery-based urban region representation learning. However, two major challenges remain: i) difficulty in aligning fine-grained visual features with long captions, and ii) suboptimal knowledge incorporation due to noise in LLM-generated captions. To address these issues, we propose a novel pre-training framework called UrbanLN that improves Urban region representation learning through Long-text awareness and Noise suppression. Specifically, we introduce an information-preserved stretching interpolation strategy that aligns long captions with fine-grained visual semantics in complex urban scenes. To effectively mine knowledge from LLM-generated captions and filter out noise, we propose a dual-level optimization strategy. At the data level, a multi-model collaboration pipeline automatically generates diverse and reliable captions without human intervention. At the model level, we employ a momentum-based self-distillation mechanism to generate stable pseudo-targets, facilitating robust cross-modal learning under noisy conditions. Extensive experiments across four real-world cities and various downstream tasks demonstrate the superior performance of our UrbanLN.

Improving Region Representation Learning from Urban Imagery with Noisy Long-Caption Supervision

The emergence of large language models (LLMs) marks a transformative era in artificial intelligence~(AI). However, systematically evaluating the capability of LLMs is challenging due to the necessity of a large number of labeled test data. To tackle this problem, in the conventional AI field, AutoEval has been proposed to estimate the capability of AI models without data labeling effort. Unfortunately, even though multiple AutoEval methods have been proposed, most are constructed for classification tasks and evaluated only on image datasets. As a result, their effectiveness for LLMs is unclear, as LLMs often target generation tasks. In this work, we introduce the first AutoEval benchmark specifically designed to estimate the capability of LLMs using unlabeled test data, AEBench. Besides existing AutoEval methods, AEBench also supports our designed method, which utilizes the correlation between data uncertainty and model ability for the capability estimation. In total, AEBench covers 12 AutoEval methods and 120 method combinations. Based on AEBench, we conducted a comprehensive study to explore the usefulness of AutoEval on LLMs. Experimental results on 10 datasets demonstrated that our designed uncertainty features-based methods perform the best in achieving the lowest estimation errors.

On the Evaluation of Capability Estimation Methods for Large Language Models

In recent years, face anti-spoofing (FAS) has made notable progress in multimodal fusion, cross-domain generalization, and interpretability. With the development of large language models and reinforcement learning (RL), strategy-based training paradigms offer new opportunities for jointly modeling multimodality, generalization, and interpretability. 
However, compared to unimodal reasoning, multimodal reasoning introduces more complex logic, such as accurate feature representation and cross-modal verification, which significantly increases inference complexity and labeling difficulty. Due to the lack of high-quality annotations in existing multimodal FAS datasets, directly applying RL strategies is sub-optimal, hindering robust multimodal reasoning. In this paper, we find two key issues of supervised fine-tuning combined with reinforcement learning (SFT+RL) paradigms in multimodal FAS inference: 1) limited multimodal reasoning paths not only hinder the full utilization of multimodal information but also constrain the model’s exploration space after SFT, thereby affecting the effectiveness of subsequent RL; and 2) the mismatch between single-task supervision and the diversity of multimodal reasoning paths leads to reasoning confusion, where models may exploit shortcuts by directly mapping input images to answers, bypassing the intended reasoning process. These issues further increase the complexity of multimodal reasoning and hinder the effective application of RL strategies.
To address these challenges, we propose the PA-FAS framework with a reasoning path enhancement strategy for high-quality extended reasoning sequences construction based on limited annotated data to enrich the reasoning paths and alleviate exploration constraints. Additionally, we introduce an answer shuffling mechanism during SFT for comprehensive multimodal analysis rather than mining superficial cues, thus encouraging deeper reasoning and avoiding shortcut learning. 
Our method significantly improves multimodal reasoning accuracy and generalization, and successfully unifies multimodal fusion, cross-domain generalization, and interpretability towards trustworthy multimodal FAS.

PA-FAS: Towards Interpretable and Generalizable Multimodal Face Anti-Spoofing via Path-Augmented Reinforcement Learning

While Large Language Models (LLMs) demonstrate immense potential for automating integrated circuit (IC) development, their practical deployment is fundamentally limited by restricted context windows. Existing context-extension methods struggle to achieve effective semantic modeling and thorough multi-hop reasoning over extensive, intricate circuit specifications. To address this, we introduce \textbf{ChipMind}, a novel knowledge graph-augmented reasoning framework specifically designed for lengthy IC specifications. ChipMind first transforms circuit specifications into a domain-specific knowledge graph (\textit{ChipKG}) through the \textit{Circuit Semantic-Aware Knowledge Graph Construction} methodology. It then leverages the \textit{ChipKG-Augmented Reasoning} mechanism, combining information-theoretic adaptive retrieval to dynamically trace logical dependencies with intent-aware semantic filtering to prune irrelevant noise, effectively balancing retrieval completeness and precision. Evaluated on an industrial-scale specification reasoning benchmark, ChipMind significantly outperforms state-of-the-art baselines, achieving an average improvement of \textbf{34.59\%} (up to \textbf{72.73\%}). Our framework bridges a critical gap between academic research and practical industrial deployment of LLM-aided Hardware Design (LAD).

ChipMind: Retrieval-Augmented Reasoning for Long-Context Circuit Design Specifications

The goal of fair clustering is to find clusters such that the proportion of sensitive attributes (e.g., gender, race, etc) in each cluster is similar to the proportion of the entire data. Various fair clustering algorithms have been proposed, which modify standard $K$-means clustering to satisfy a given fairness constraint. A critical limitation of several existing fair clustering algorithms is that the number of parameters to be learned is proportional to the sample size because the cluster assignment of each datum should be optimized simultaneously with the cluster center, and thus scaling up the algorithms is difficult. In this paper, we propose a new fair clustering algorithm based on finite mixture model called Fair Model-based Clustering (FMC). A main advantage of FMC is that the number of learnable parameters is independent to the sample size and thus can be scaled up easily. In particular, a mini-batch learning is possible to obtain clusters that are approximately fair. Moreover, FMC can be applied to non-metric data (e.g., categorical data) as long as the likelihood is well-defined. Theoretical and empirical justifications of the superiority of the proposed algorithm are provided.

Downloads

Next from AAAI 2026

ParaDySe: A Parallel Strategy Switching Framework for Dynamic Sequences in Transformer-based Large Language Models

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES