Singapore

Aiming to estimate the full extent of partially occluded objects, amodal segmentation is a critical capability for visual intelligence. Existing methods suffer from limitations in efficiency and precision, due to their reliance on auxiliary information or two-stage architectures. Furthermore, they lack generalizability, failing to meet practical requirements. To overcome these challenges, we proposed a new paradigm, CondDiff-AMO, that interprets amodal segmentation as a denoising problem by leveraging diffusion models. Methodologically, the designed novel framework consists of three key innovations to adapt the task characteristics and unlocks the diffusion models’ potential in amodal segmentation, including a masking strategy in the forward process, an adaptive transformer for conditional feature extraction, and visual-guided sampling. In the forward process, progressive masking strategy converts ground-truth masks to visible masks, simulating amodal segmentation process to enhance reasoning regarding occluded areas. For architectural design, a pyramid network with feature refinement extracts adaptive and representative conditional priors, improving the guidance in the denoising process of diffusion models. As for the sampling stage, a visible mask is incorporated with an ensemble strategy, restricting the prediction on occluded part. Experiments were conducted on five well-known datasets under supervised and zero-shot learning, with the results confirming that CondDiff-AMO outperforms state-of-the-art methods.

AAAI 2026

CondDiff-AMO: Integrating Conditional Diffusion Mechanism for Unified Amodal Mask Generation

diffusion models for vision

scene analysis & understanding

segmentation

technical paper

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Continual Test-Time Domain Adaptation (CTTA) aims to adapt a pre-trained source model to a dynamically evolving target domain without requiring additional data collection or labeling efforts. A key challenge in this setting is to achieve rapid performance improvement in the current domain using unlabeled data, while avoiding impairing generalization to future domains in complex scenarios. To enhance the discriminative capability of the inference models, we propose a novel framework that integrates an external auxiliary generative model with a test-time adaptive method, leveraging cross-validation to identify reliable supervisory signals. Specifically, for each test instance, we utilize a diffusion module to generate a calibrated instance under the textual description of its predicted category. Based on the generated one, we design a learning strategy with the following components: (1) the calibrated instance and its category are used to form a supervisory signal; (2) the predicted category of the calibrated instance is compared with the test instance for selecting reliable signals. For these generated and selected instances, adaptive weighting is applied during optimization to stabilize the category distribution and preserve prediction diversity. Finally, based on the inverse process of diffusion, we construct the negative instance of the generated instance and introduce a robust contrastive learning to further calibrate model optimization. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple benchmarks. Ablation studies further validate the effectiveness of each proposed component.

Diffusion-calibrated Continual Test-time Adaptation

Knowledge distillation (KD) is a key technique for compressing large-scale language models (LLMs), yet prevailing logit-based methods typically employ static strategies that are misaligned with the dynamic learning process of student models. These methods typically treat all tokens indiscriminately and apply a single, fixed temperature, resulting in suboptimal knowledge transfer. To address these limitations, we propose LLM-Oriented Token-Adaptive Knowledge Distillation (AdaKD), a novel framework that adapts the distillation process to the real-time learning state of each token. AdaKD consists of two synergistic modules driven by a unified token difficulty metric. First, our Loss-Driven Adaptive Token Focusing (LATF) module dynamically adjusts the distillation focus by monitoring the student's learning stability, concentrating computational resources on the most valuable tokens at each training phase. Second, we introduce Inverse Difficulty Temperature Scaling (IDTS), a counterintuitive yet effective token-level temperature strategy. It employs low temperatures for difficult tokens for targeted error correction, and high temperatures for easy tokens to encourage students to learn from the teacher's complete and smooth output distribution, thereby enhancing generalization. As a plug-and-play framework, AdaKD can consistently improve the performance of various distillation methods on multiple model architectures and benchmarks.

LLM-Oriented Token-Adaptive Knowledge Distillation

Dynamic Vision Sensor (DVS) asynchronously records sparse events triggered by changes in pixel intensity, offering high temporal resolution and low latency. Existing frame-based methods process event data densely, violating its inherent sparsity and introducing computational redundancy. While asynchronous models preserve the event stream's native format, they often neglect spatial information, compromising their adaptability and efficiency. To address these limitations, we propose a Spatiotemporally Separated Sparse Network (S3Net) for efficient event stream encoding and learning. Specifically, we employ a learnable sparse encoding scheme to construct a voxel-structured representation that effectively extracts spatiotemporal relationships among event data. After that, we propose a dual-branch architecture to capture localized spatial dependencies and dynamic temporal patterns of event data. By explicitly decoupling spatial and temporal modeling, S3Net enables end-to-end asynchronous processing of variable-length event sequences, achieving both strong representational capacity and high computational efficiency. Experimental results on six event-based datasets demonstrate that S3Net achieves state-of-the-art performance. Compared to frame-based methods, it significantly reduces computational overhead and model complexity, while also outperforming existing asynchronous approaches in inference speed without compromising accuracy. Extensive experiments across six event-based datasets show that S3Net establishes new state-of-the-art performance. Our method reduces computational costs by 35% and model parameters by 27% compared to frame-based approaches, while delivering 1.58× faster inference than existing point-based methods at comparable accuracy levels.

S3Net: Spatiotemporally Separated Sparse Network for Neuromorphic Vision Processing

Incomplete multi-modal emotion recognition (IMER) aims at understanding human intentions and sentiments by comprehensively exploring the partially observed multi-source data. Although the multi-modal data is expected to provide more abundant information, the performance gap and modality under-optimization problem hinder effective multi-modal learning in practice, and are exacerbated in the confrontation of the missing data. To address this issue, we devise a novel Cross-modal Prompting (ComP) method, which emphasizes coherent information by enhancing modality-specific features and improves the overall recognition accuracy by boosting each modality's performance. Specifically, a progressive prompt generation module with a dynamic gradient modulator is proposed to produce concise and consistent modality semantic cues. Meanwhile, cross-modal knowledge propagation selectively amplifies the consistent information in modality features with the delivered prompts to enhance the discrimination of the modality-specific output. Additionally, a coordinator is designed to dynamically re-weight the modality outputs as a complement to the balance strategy to improve the model's efficacy. Extensive experiments on 4 datasets with 7 SOTA methods under different missing rates validate the effectiveness of our proposed method.

Cross-modal Prompting for Balanced Incomplete Multi-modal Emotion Recognition

Since a building's floorplan remains consistent over time and is inherently robust to changes in visual appearance, visual Floorplan Localization (FLoc) has received increasing attention from researchers. However, as a compact and minimalist representation of the building's layout, floorplans contain many repetitive structures (e.g., hallways and corners), thus easily result in ambiguous localization. Existing methods either pin their hopes on matching 2D structural cues in floorplans or rely on 3D geometry-constrained visual pre-trainings, ignoring the richer contextual information provided by visual images. In this paper, we suggest using broader visual scene context to empower FLoc algorithms with scene layout priors to eliminate localization uncertainty. In particular, we propose an unsupervised learning technique with clustering constraints to pre-train a room discriminator on self-collected unlabeled room images. Such a discriminator can empirically extract the hidden room type of the observed image and distinguish it from other room types. By injecting the scene context information summarized by the discriminator into an FLoc algorithm, the room style knowledge is effectively exploited to guide definite visual FLoc. We conducted sufficient comparative studies on two standard visual Floc benchmarks. Our experiments show that our approach outperforms state-of-the-art methods and achieves significant improvements in robustness and accuracy.

Perspective from a Broader Context: Can Room Style Knowledge Help Visual Floorplan Localization?

Knowledge Tracing (KT) aims to mine students’ evolving knowledge states and predict their future question-answering performance. Existing methods based on heterogeneous information networks (HINs) are prone to introducing noises due to manual or random selection of meta-paths and lack necessary quality assessment of meta-path instances. Conversely, recent large language models (LLMs)-based methods ignore the rich information across students, and both paradigms struggle to deliver consistently accurate and evidence-based explanations. To address these issues, we propose an innovative framework, HIN-LLM Synergistic Enhanced Knowledge Tracing (HISE-KT), which seamlessly integrates HINs with LLMs. HISE-KT first builds a multi-relationship HIN containing diverse node types to capture the structural relations through multiple meta-paths. The LLM is then employed to intelligently score and filter meta-path instances and retain high-quality paths, pioneering automated meta-path quality assessment. Inspired by educational psychology principles, a similar student retrieval mechanism based on meta-paths is designed to provide a more valuable context for prediction. Finally, HISE-KT uses a structured prompt to integrate the target student's history with the retrieved similar trajectories, enabling the LLM to generate not only accurate predictions but also evidence-backed, explainable analysis reports. Experiments on four public datasets show that HISE-KT outperforms existing KT baselines in both prediction performance and interpretability.

HISE-KT: Synergizing Heterogeneous Information Networks and LLMs for Explainable Knowledge Tracing with Meta-Path Optimization

To identify the root causes of attacks, behavior abstraction (BA) converts audit logs into multiple behavior graphs and finds similar ones, which has proven effective in bridging the semantic gap and reducing manual workload. Existing works fail to achieve both interpretability and generalization, while also exhibiting limited robustness when facing adversarial attacks. In this paper, we give the first attempt at interpretable and robust behavior abstraction and propose a novel method called 
$\textit{\textbf{E}nvironment-\textbf{D}isentangled \textbf{H}eterogeneous \textbf{G}raph \textbf{N}eural \textbf{N}etwork (\textbf{EDHGNN})}$. Motivated by Information Bottleneck (IB) principle, we propose a Heterogeneous Subgraph Disentanglement (HSD) module to disentangle label-relevant and environmental subgraphs through single optimization. We also introduce an Adapted Graph-Level Attention (AGLA) module to extract minimal sufficient representations from label-relevant subgraphs, a Label-Guided Graph Reconstructor (LGGR) to maximize environmental information coverage via reconstruction, and a Relevance Discriminator (RD) to enhance disentanglement quality. Additionally, we construct a new dataset contains ground-truth explanations and 4,160 behavior graphs. Extensive experiments demonstrate that EDHGNN outperforms the state-of-the-art methods in terms of interpretability and robustness against
adversarial attacks.

Interpretable and Robust Behavior Abstraction via Environment-Disentangled Heterogeneous Graph

The Multi-Agent Path Finding (MAPF) problem is a computationally challenging task that involves coordinating collision-free trajectories for multiple cooperative agents. Although existing methods address corridor symmetry, where agents encounter repeated bidirectional conflicts in constrained environments, they typically focus exclusively on pairwise agent interactions. Our observations reveal that such pairwise symmetry frequently arises when multiple agents traverse shared corridors, necessitating repeated applications of the corridor reasoning technology over extended durations. To overcome this limitation, we propose a multi-agent corridor reasoning (MAC) technology capable of resolving group-level corridor symmetry in a single optimization step. Our theoretical analysis demonstrates that this technology preserves the completeness and optimality guarantees of Conflict-Based Search (CBS). By integrating MAC technology with CBSH-RTC, we developed CBSH-MACRT, which significantly outperforms state-of-the-art algorithms (CBSH-RTC and CBSH with mutex propagation) on standardized MAPF benchmarks, improving success rates by 8–40\% and cutting runtimes by 14–67\%.

Multi-Agent Corridor Reasoning for Multi-Agent Path Finding

In recent years, there has been growing interest in understanding the expressive power of graph neural networks (GNNs) by relating them to logical languages. This research has been been initialised by an influential result of Barceló et al. (2020), who showed that the graded modal logic (or a guarded fragment of the logic C2), characterises the logical expressiveness of aggregate-combine GNNs. As a ‘challenging open problem’ they left the question whether full C2 characterises the logical expressiveness of aggregate-combine-readout GNNs. This question has remained unresolved despite several attempts. In this paper, we solve the above open problem by proving that the logical expressiveness of aggregate-combine-readout GNNs strictly exceeds that of C2. This result holds over both undirected and directed graphs. Beyond its implications for GNNs, our work also leads to purely logical insights on the expressive power of infinitary logics.

Aggregate-Combine-Readout GNNs Can Express Logical Classifiers Beyond the Logic C2

This paper proposes a framework for improving the operational efficiency of automated storage systems under uncertainty.
Recent years have seen a rise in automated grid-based storage for uniform-sized \emph{loads}, e.g., containers, pallets, totes.
Such systems face a fundamental tradeoff between maximizing space utilization and minimizing costly load relocation efforts during storage and retrieval operations. 
The focus here is on a setting with unique loads that can move along cardinal directions using a single mobile manipulator, such as a robot.
The setting consists of two distinct phases, common in some logistics applications, such as last-mile distribution centers and shipyards: i) storage of all the loads, followed by ii) their retrieval.
The goal is to minimize relocations for both phases, especially when the storage system is at capacity.
Previous efforts have shown that with known storage and retrieval orders, zero relocations can be achieved for storage at full capacity, provided that the size of the opening through which loads are stored and retrieved (grid width) is at least 3.

In realistic scenarios, however, schedules may be uncertain, i.e., loads may be stored or retrieved out of order, rendering previous approaches suboptimal.
The model of uncertainty in this work assumes that any two departing loads may depart out of order if they are originally at most $k$ positions apart. Under this model, this work
generalizes the previous result and proves that a grid width of $\Theta(k)$ is necessary and sufficient for eliminating relocations via robust storage arrangements.
An efficient solver is introduced to find such robust arrangements.
Furthermore, when relocations become inevitable, such as when loads are retrieved out of order by more than $k$, a strategy is introduced that effectively minimizes total relocations.
Extensive experiments show that, for $k$ up to half the grid width, the proposed storage and retrieval approaches essentially eliminate relocations.
For high uncertainty, i.e., $k$ values up to the full grid width, relocations are reduced by $50\%+$.

Downloads

Next from AAAI 2026

Diffusion-calibrated Continual Test-time Adaptation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES