Singapore

Video-based visible-infrared person re-identification (VVI-ReID) aims to match pedestrian video sequences captured across different modalities and viewpoints, and plays a critical role in all-day intelligent surveillance. While recent supervised methods have shown promising results, they rely on large-scale cross-modal video annotations, which are expensive and difficult to obtain in practice.
To address this limitation, we introduce the task of unsupervised domain adaptation for video-based visible-infrared person re-identification (UDA-VVI-ReID), where a model is transferred from a labeled source domain to an unlabeled target domain. This setting presents unique challenges, including modality discrepancies, temporal variations, and the difficulty of generating reliable pseudo-labels under occlusion or motion noise.
To tackle these issues, we propose a Dynamic-Static Collaboration (DSC) framework that integrates two key modules. The Dynamic-Static Label Unification (DSLU) module refines pseudo-labels by enforcing consistency between appearance and motion features across modalities. The Dynamic-Static Joint Learning (DSJL) module further enhances representation learning through contrastive objectives and neighbor-based feature alignment guided by both dynamic and static cues. Experimental results on the HITSZ-VCM and BUPTCampus datasets demonstrate that this method achieves state-of-the-art performance among unsupervised approaches without relying on target domain labels.

AAAI 2026

Dynamic-Static Collaboration for Unsupervised Domain Adaptive Video-Based Visible-Infrared Person Re-Identification

learning & optimization for cv

cv: image and video retrieval

computer vision (cv)

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

To achieve immersive spatial audio rendering on VR/AR devices, high-quality Head-Related Transfer Functions (HRTFs) are essential. In general, HRTFs are subject-dependent and position-dependent, and their measurement is time-consuming and tedious. To address this challenge, we propose the Graph Neural Field with Spatial-Correlation Augmentation (GraphNF-SCA) for HRTF personalization, which can be used to generate individual HRTFs for unseen subjects. The GraphNF-SCA consists of three key components: an HRTF personalization (HRTF-P) module, an HRTF upsampling (HRTF-U) module, and a fine-tuning stage. In the HRTF-P module, we predict HRTFs of the target subject via the Graph Neural Network (GNN) with an encoder-decoder architecture, where the encoder extracts universal features and the decoder incorporates the target-relevant features and produces individualized HRTFs. The HRTF-U module employs another GNN to model spatial correlations across HRTFs. This module is fine-tuned using the output of the HRTF-P module, thereby enhancing the spatial consistency of the predicted HRTFs. Unlike existing methods that estimate individual HRTFs position-by-position without spatial correlation modeling, the GraphNF-SCA effectively leverages inherent spatial correlations across HRTFs to enhance the performance of HRTF personalization. Experimental results demonstrate that the GraphNF-SCA achieves state-of-the-art results.

Graph Neural Field with Spatial-Correlation Augmentation for HRTF Personalization

Graph anomaly detection (GAD), which aims to detect outliers in graph-structured data, has received increasing research attention recently. However, existing GAD methods assume identical training and testing distributions, which is rarely valid in practice. In real-world scenarios, unseen but normal samples may emerge during deployment, leading to a normality shift that degrades the performance of GAD models trained on the original data. Through empirical analysis, we reveal that the degradation arises from (1) semantic confusion, where unseen normal samples are misinterpreted as anomalies due to their novel patterns, and (2) aggregation contamination, where the representations of seen normal nodes are distorted by unseen normals through message aggregation. While retraining or fine-tuning GAD models could be a potential solution to the above challenges, the high cost of model retraining and the difficulty of obtaining labeled data often render this approach impractical in real-world applications. To bridge the gap, we proposed a lightweight and plug-and-play Test-time adaptation framework for correcting Unseen Normal pattErns (TUNE) in GAD. To address semantic confusion, a graph aligner is employed to align the shifted data to the original one at the graph attribute level. Moreover, we utilize the minimization of representation-level shift as a supervision signal to train the aligner, which leverages the estimated aggregation contamination as a key indicator of normality shift. Extensive experiments on 10 real-world datasets demonstrate that TUNE significantly enhances the generalizability of pre-trained GAD models to both synthetic and real unseen normal patterns.

Correcting False Alarms from Unseen: Adapting Graph Anomaly Detectors at Test Time

Distributed multi-agent systems are increasingly deployed in dynamic and high-stakes environments such as power grids, intelligent traffic systems, and collaborative robotics. In these systems, long-term stability, the ability to maintain coherent and safe system behavior over time, is critical but underexplored in existing research. This paper presents LLMASC, a framework designed to enhance long-term stability in multi-agent collaboration by combining semantic reasoning with decentralized control. LLMASC comprises three key components: a Semantic Perception Encoder that transforms heterogeneous agent observations into structured natural language; an LLM-Guided Consensus Decision module that enables strategic alignment through proposal exchange and voting; and a Policy Execution Controller that maps high-level plans to executable actions via reinforcement learning. We evaluate LLMASC across three representative simulation domains (Multi-Walker, Simulation of Urban Mobility and Power Grid Stabilization), spanning both physical and cyber-physical systems. 
Experiments show that LLMASC consistently outperforms the best baselines, improving stability rates by up to 39% and long-term success by 31%. Further analysis confirms its decision-making efficiency and robustness under varying agent populations and model choices.

Many Minds, One Path: LLM-Augmented Consensus Decision for Distributed Control in Multi-Agent Collaborative Stable Scenarios

Class Incremental Learning (CIL) aims to enable models to continually learn new classes while retaining previously learned knowledge. The principal challenge in CIL is catastrophic forgetting, which prior approaches typically address by distilling knowledge from previous model. However, such way is often limited to pairwise alignment, failing to preserve the underlying global manifold structure of feature space—ultimately resulting in semantic drift over time. To capture multi-scale structural patterns in the feature space, we propose a topology-aware distillation framework that leverages persistent homology. Specifically, by enforcing topological alignment across incremental stages, our method ensures structure-consistent knowledge transfer and robust preservation of old classes. Furthermore, we still devise a dual-branch architecture with an inverse sampling and dynamic reweighting mechanism that addresses the inherent data imbalance in standard replay-based frameworks. These innovations coalesce into TaKP (Topology-aware Knowledge Preservation), a unified framework designed to enhance knowledge preservation in CIL. Extensive experiments demonstrate that TaKP achieves state-of-the-art performance on multiple benchmarks, significantly improving old-class preservation and average accuracy.

Topology-aware Knowledge Preservation for Class-Incremental Learning

Recent advances in multi-modal large language models (MLLMs) have significantly improved object-level grounding and region captioning. However, they remain limited in visual relation understanding, struggling even with binary relation detection, let alone N-ary relations involving multiple semantic roles. The core reason is the lack of modeling for structural semantic dependencies among multi-entities, leading to over-reliance on language priors (e.g., defaulting to "person drinks a milk" if a person is merely holding it). To this end, we propose Relation-R1, the first unified relation comprehension framework that explicitly integrates cognitive chain-of-thought (CoT)-guided supervised fine-tuning (SFT) and group relative policy optimization (GRPO) within a reinforcement learning (RL) paradigm. Specifically, we first establish foundational reasoning capabilities via SFT, enforcing structured outputs with thinking processes. Then, GRPO is utilized to refine these outputs via multi-rewards optimization, prioritizing visual-semantic grounding over language-induced biases, thereby improving generalization capability. Furthermore, we investigate the impact of various CoT strategies within this framework, demonstrating that a specific-to-general progressive approach in CoT guidance further improves generalization, especially in capturing synonymous N-ary relations. Extensive experiments on widely-used PSG and SWiG datasets demonstrate that Relation-R1 achieves state-of-the-art performance in both binary and N-ary relation understanding.

Relation-R1: Progressively Cognitive Chain-of-Thought Guided Reinforcement Learning for Unified Relation Comprehension

Spoken Language Understanding (SLU) consists of two sub-tasks: intent detection (ID) and slot filling (SF). Given its broad range of real-world applications, enhancing SLU for practical deployment is increasingly critical. Profile-based SLU addresses ambiguous user utterances by incorporating context awareness (CA), user profiles (UP), and knowledge graphs (KG) to support disambiguation, thereby advancing SLU research toward real-world applicability. However, existing SLU datasets still fall short in representing real-world scenarios. Specifically, (1) CA uses one-hot vectors for representation, which is overly idealized, and (2) models typically focuses solely on predicting intents and slot labels, neglecting the reasoning process that could enhance performance and interpretability. To overcome these limitations, we introduce VRSLU, a novel SLU dataset that integrates both Visual images and explicit Reasoning. For over-idealized CA, we use GPT-4o and FLUX.1-dev to generate images reflecting users’ environments and statuses, followed by human verification to ensure quality. For reasoning, GPT-4o is employed to generate explanations for predicted labels, which are then refined by human annotators to ensure accuracy and coherence. Additionally, we propose an instructional template, LR-Instruct, which first predicts labels and then generates corresponding reasoning. This two-step approach helps mitigate the influence of reasoning bias on label prediction. Experimental results confirm the effectiveness of incorporating visual information and highlight the promise of explicit reasoning in advancing SLU.

Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding

Speaker diarization is a fundamental task in speech processing aims to determine 'who speaks when'. When combined with ASR, it enables speaker-labeled transcription with broad practical value. Most existing methods rely on frame-level classification, but the high cost of annotating mixed-speaker audio limits the availability of large-scale, accurately labeled datasets. As a result, even state-of-the-art models struggle with imprecise speaker boundary detection and semantic segmentation errors, which degrade timestamp accuracy and downstream ASR performance. To address these challenges, we propose WhisperDiari, a unified framework for speaker diarization and ASR. We first construct LibriDiari, a dataset derived from LibriSpeech, containing 2–4 speaker mixed audio annotated with transcripts and speaker labels. WhisperDiari builds on the Whisper model, incorporating speaker adapters and Speaker Similarity Matrix Supervision to enhance speaker representation. In addition, a dedicated speaker decoder fuses speaker embeddings with contextual semantics from Whisper's decoder, enabling token-level diarization. This design effectively resolves segmentation ambiguity, aligns diarization with semantic units, and jointly models 'who speaks what and when', producing accurate, timestamped transcripts. We train the model on LibriDiari and evaluate it on both LibriDiari and the real-world AMI corpus. Experimental results demonstrate that WhisperDiari consistently outperforms state-of-the-art open-source baselines.

WhisperDiari: A Whisper-Based Speaker Diarization Framework in Token Space Leveraging Semantic and Speaker Information for Better Text Adaptability

End-to-end planning methods are the de-facto standard of the current autonomous driving system, while the robustness of the data-driven approaches suffers due to the notorious ``long-tail" problem (i.e., rare but safety-critical failure cases). 
In this work, we explore whether recent diffusion-based video generation methods (a.k.a. world models), paired with structured 3D layouts, can enable a fully automated pipeline to self-correct such failure cases. We first introduce an agent to simulate the role of product manager, dubbed **PM-Agent**, which formulates data requirements to collect data similar to the failure cases. Then, we use a generative model that can simulate both data collection and annotation. However, existing generative models struggle to generate high-fidelity data conditioned on 3D layouts. To address this, we propose **DriveSora**, which can generate spatiotemporally consistent videos aligned with the 3D annotations requested by PM-Agent. We integrate these components into our self-correcting agentic system, **CorrectAD**. Importantly, our pipeline is end-to-end model agnostic and can be applied to improve any end-to-end planner.
Evaluated on both nuScenes and a more challenging in-house dataset across multiple end-to-end planners, CorrectAD corrects 62.5% and 49.8% of failure cases, reducing collision rates by 39% and 27%, respectively.

CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving

Large language models hold great promise for transforming K–12 education, but there is an urgent need for systematic evaluation of their core educational capabilities. Existing benchmarks often overlook educational goal cognition and overemphasize answer accuracy, thereby failing to capture deeper subject-level knowledge ability and problem-solving ability. To address this gap, we introduce K-12EduBench: a benchmark for evaluating LLMs’ subject-level knowledge ability, subject-specific problem-solving ability, and educational goal cognition ability in K–12 education. K-12EduBench comprises four components: (1) a dataset of 2,640 objective and 619 subjective questions across nine subjects, annotated with answers, problem-solving processes, and cognitive-level labels; (2) nine Item Response Theory (IRT) models for estimating subject-level knowledge ability; (3) evaluation methods and metrics for assessing multi-step problem-solving ability; and (4) prompts and scoring rubrics for measuring alignment with target cognitive levels. Experiments on advanced LLMs show that education-optimized models consistently outperform general-purpose ones across all three abilities, while under-scaled models lag substantially. We observe a strong positive correlation between subject-level knowledge ability and subject-specific problem-solving ability. Despite gains in educational goal cognition ability, current models—even those tailored for education—still fall short of real-world instructional needs. All code and data will be released publicly.

K-12EduBench: A Benchmark for Evaluating Large Language Models’ Knowledge, Problem-Solving, and Educational Goal Cognition in K-12 Education

Goal-conditioned Reinforcement Learning (RL) is a promising direction for training agents capable of tackling a variety of tasks. However, generalizing to new goals in different environments remains a central challenge for goal-conditioned RL agents. Existing methods often rely on state abstraction, which involves learning abstracted state representations by excluding irrelevant features, to improve generalization. Despite their success in simplified settings, these methods often fail to generalize effectively to realistic environments with varied goals.
In this work, we propose to enhance generalization through state abstraction from the perspective of causal inference. 
We hypothesize that the generalization gap arises in part due to unobserved confounders: latent variables that simultaneously influence both the global and goal states. To address this, we introduce Deconfounded State Abstraction for Policy learning (DSAP), a novel framework that mitigates backdoor confounding by employing a learned causal graph as a *proxy* for the hidden confounders.
We provide theoretical analysis demonstrating that DSAP improves both the learning process and the generalization capability of goal-conditioned policies. Extensive experiments across different settings of multiple benchmarks show that our method significantly outperforms existing methods.

Content not yet available

Next from AAAI 2026

Graph Neural Field with Spatial-Correlation Augmentation for HRTF Personalization

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES