Singapore

In this paper, we address a challenging task, synchronous motion captioning, that aim to generate a language description synchronized with human motion sequences. This task pertains to numerous applications, such as aligned sign language transcription and unsupervised action segmentation and temporal grounding. Our method introduces mechanisms to control self- and cross-attention distributions of the Transformer, allowing interpretability and aligned text generation. We achieve this through masking strategies and structuring losses that push the model to maximize attention only on the most important frames contributing to the generation of a motion word. These constraints aim to prevent undesired mixing of information in attention maps and to provide a monotonic attention distribution across tokens. Thus, the cross attentions of tokens are used for progressive text generation in synchronization with human motion sequences. We demonstrate the superior performance of our approach through evaluation on the two available benchmark datasets, KIT-ML and HumanML3D. As visual evaluation is essential for this task, we provide a comprehensive set of animated visual illustrations of the output of synchronous text generation in the supplementary. The code will be made publicly available.

AAAI 2026

Transformer with Controlled Attention for Synchronous Motion Captioning

language grounding & multi-modal nlp

scene analysis & understanding

video understanding & activity analysis

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Reinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivity of these tasks prevents the verifiable rewards as required by Reinforcement Learning with Verifiable Rewards (RLVR); (2) Reinforcement Learning from Human Feedback (RLHF) relies on external reward mechanisms. To overcome these limitations, we propose Self-Examining Reinforcement Learning (SERL), a novel self-improving framework where the LLM serves as both Actor and Judge. SERL introduces two synergistic reward mechanisms without any external signals. On the one hand, to improve the Actor's capability, we derive rewards from Copeland-style pairwise comparison judgments across a group of generated responses. On the other hand, a self-consistency reward that encourages coherent judgments is proposed to improve the Judge's reliability. This process refines the Judge's capability, which in turn provides a more robust reward for Actor. Experiments show that our method outperforms existing self-improvement training methods. SERL improves the LC win rate of Qwen3-8B on AlpacaEval 2 from 52.37% to 59.90%. To the best of our knowledge, our method achieves state-of-the-art performance among self-improving approaches. Furthermore, it achieves a performance comparable to significantly larger models like Qwen3-32B, demonstrating superior effectiveness and robustness on open-domain tasks.

SERL: Self-Examining Reinforcement Learning on Open-Domain

Humanoid robots exhibit significant potential in executing diverse human-level skills. However, current research predominantly relies on data-driven approaches that necessitate extensive training datasets to achieve robust multimodal decision-making capabilities and generalizable visuomotor control. These methods raise concerns due to the neglect of geometric reasoning in unseen scenarios and the inefficient modeling of robot-target relationships within the training data, resulting in significant waste of training resources. To address these limitations, we present the **R**ecurrent **G**eometric-prior **M**ultimodal **P**olicy (**RGMP**), an end-to-end framework that unifies geometric-semantic skill reasoning with data-efficient visuomotor control. For perception capabilities, we propose the Geometric-prior Skill Selector, which infuses geometric inductive biases into a vision language model, producing adaptive skill selections for unseen scenes with minimal common sense tuning. To achieve data-efficient robotic motion synthesis, we introduce the Adaptive Recursive Gaussian Network, which parameterizes robot–object interactions as a compact hierarchy of Gaussian processes that recursively encode multi-scale spatial relationships, yielding dexterous, data-efficient motion synthesis even from sparse demonstrations. Evaluated on both our humanoid robot and desktop dual-arm robot, the RGMP framework achieves 87% task success in generalization tests and exhibits 5× greater data efficiency than the state-of-the-art model. This performance underscores its superior cross-domain generalization, enabled by geometric-semantic reasoning and recursive-Gaussion adaptation.

RGMP: Recurrent Geometric-prior Multimodal Policy for Generalizable Humanoid Robot Manipulation

As mixing services are increasingly exploited by malicious actors for illicit transactions, mixing address association has emerged as a critical research task. A range of approaches have been explored, with graph-based models standing out for their ability to capture structural patterns in transaction networks. However, these approaches face two main challenges: label noise and label scarcity, leading to suboptimal performance and limited generalization. To address these, we propose HiLoMix, a graph-based learning framework specifically designed for mixing address association. First, we construct the Heterogeneous Attributed Mixing Interaction Graph (HAMIG) to enrich the topological structure. Second, we introduce frequency-aware graph contrastive learning that captures complementary structural signals from high- and low-frequency graph views. Third, we employ weak supervised learning that assigns confidence-based weighting to noisy labels. Then, we jointly train high-pass and low-pass GNNs using both unsupervised contrastive signals and confidence-based supervision to learn robust node representations. Finally, we adopt a stacking framework to fuse predictions from multiple heterogeneous models, further improving generalization and robustness. Experimental results demonstrate that HiLoMix outperforms existing methods in mixing address association.

HiLoMix: Robust High- and Low-Frequency Graph Learning Framework for Mixing Address Association

Single-channel audio separation, which separates individual audio sources from monophonic mixtures, remains challenging in audio processing. Supervised approaches, typically trained on synthetically generated mixtures, are the prevailing solution. However, these methods depend on high-quality paired training data, which is often scarce or difficult to acquire in real-world scenarios. This data scarcity can hinder model performance on unseen mixing conditions, compromising generalization capabilities. To this end, in this work, we approach the problem from an unsupervised perspective, framing it as a probabilistic inverse problem. Our method requires only training diffusion priors on individual sources. Separation is then achieved by iteratively guiding an initial state towards the solution through reconstruction guidance. Crucially, we introduce an advanced inverse solver designed for separation, which mitigates gradient conflicts arising from the interference between the diffusion prior and the reconstruction guidance during the denoising process. This effectively ensures high-quality and balanced separation performance across individual sources. In addition, we found that using an augmented mixture instead of pure Gaussian noise to initialize the denoising process is effective, and this informative prior significantly improves the final performance. Furthermore, to enhance audio prior modeling, we designed a novel Time-Frequency (TF) attention-based network architecture that demonstrates powerful audio modeling capabilities. These collective improvements significantly enhance separation performance, as demonstrated by our experimental results across speech-sound event, sound event, and speech separation tasks. Audio demonstrations are available in the Supplementary Material.

Unsupervised Single-Channel Audio Separation with Diffusion Source Priors

Deep learning have achieved remarkable success in image recognition, yet their inherent opacity poses challenges for deployment in critical domains. Concept-based interpretations aim to address this by explaining model reasoning through human-understandable concepts. However, existing post-hoc methods and ante-hoc concept bottleneck models (CBMs), suffer from limitations such as unreliable concept relevance, non-visual or labor-intensive concept definitions, and model or data-agnostic assumptions. This paper introduces 
$\textbf{P}$ost-hoc $\textbf{C}$oncept $\textbf{B}$ottleneck $\textbf{M}$odel via $\textbf{Re}$presentation $\textbf{D}$ecomposition ($\textbf{PCBM-ReD}$), a novel pipeline that synergizes the strengths of both paradigms. PCBM-ReD automatically extracts visual concepts from a pre-trained encoder, employs multimodal large language models (MLLMs) to label and filter concepts based on visual identifiability and task relevance, and selects an independent subset via reconstruction-guided optimization. Leveraging CLIP’s visual-text alignment, it decomposes image representations into linear combination of concept text embeddings to fit into the CBMs abstraction. Our extensive experiments across 11 image classification tasks demonstrate that PCBM-ReD achieves state-of-the-art accuracy, narrows the performance gap with end-to-end models, and exhibits better interpretability. The code will be released.

Concepts from Representations: Post-hoc Concept Bottleneck Models via Sparse Decomposition of Visual Representations

The Contrastive Language-Image Pre-Training (CLIP) model excels in few-shot learning by aligning visual and textual representations. Our study shows that template-sample similarity (TSS), defined as the resemblance between a text template and an image sample, introduces bias. This bias leads the model to rely on template proximity rather than true sample-to-category alignment, reducing both accuracy and robustness in classification.
We present a framework that uses empty prompts, textual inputs that convey the idea of “emptiness” without category information. These prompts capture unbiased template features and offset TSS bias. The framework employs two stages. During pre-training, empty prompts reveal and reduce template-induced bias within the CLIP encoder. During few-shot fine-tuning, a bias calibration loss enforces correct alignment between images and their categories, ensuring the model focuses on relevant visual cues.
Experiments across multiple benchmarks demonstrate that our template correction method significantly reduces performance fluctuations caused by TSS, yielding higher classification accuracy and stronger robustness.

Decoupling Template Bias in CLIP: Harnessing Empty Prompts for Enhanced Few-Shot Learning

Recently, rapid advancements have been made in multimodal large language models (MLLMs), especially in video understanding tasks. However, current research focuses on simple video scenarios, failing to reflect the complex and diverse nature of real-world audio-visual events in videos. To bridge this gap, we firstly introduce R-AVST, a dataset for audio-visual reasoning featuring fine-grained spatio-temporal annotations. In constructing this, we design a pipeline consisting of LLM-based key object extraction, automatic spatial annotation and manual quality inspection, resulting in over 5K untrimmed videos with 27K objects across 100 types of audio-visual events. Building on this dataset, we define three core tasks for spatio-temporal reasoning in audio-visual scenes and generate more than 8K high-quality, evenly distributed question-answer pairs to effectively benchmark model performance. To further enhance reasoning, we propose AVST-Zero, a reinforcement learning-based model that avoids intermediate supervision, directly optimizing behavior via carefully designed multi-dimensional rewards. Extensive experiments validate the effectiveness of our R-AVST in advancing audio-visual spatio-temporal reasoning, upon which AVST-Zero demonstrates competitive performance compared to existing models. To the best of our knowledge, R-AVST is the first dataset designed for real-world audio-visual spatio-temporal reasoning, and AVST-Zero offers a novel perspective for tackling future challenges in this domain.

R-AVST: Empowering Video-LLMs with Fine-Grained Spatio-Temporal Reasoning in Complex Audio-Visual Scenarios

Active Label Correction (ALC) has emerged as a promising solution to the high cost and error-prone nature of manual pixel-wise annotation in semantic segmentation, by selectively identifying and correcting mislabeled data. Although recent work has improved correction efficiency by generating pseudo-labels using foundation models, substantial inefficiencies still remain. In this paper, we propose Active and Automated Label Correction for semantic segmentation (A$^2$LC), a novel and efficient ALC framework that integrates an automated correction stage into the conventional pipeline. Specifically, the automated correction stage leverages annotator feedback to perform label correction beyond the queried samples, thereby maximizing cost efficiency. In addition, we further introduce an adaptively balanced acquisition function that emphasizes underrepresented tail classes and complements the automated correction mechanism. Extensive experiments on Cityscapes and PASCAL VOC 2012 demonstrate that A$^2$LC significantly outperforms previous state-of-the-art methods. Notably, A$^2$LC achieves high efficiency by outperforming previous methods using only 20\% of their budget, and demonstrates strong effectiveness by yielding a 27.23\% performance improvement under an equivalent budget constraint on the Cityscapes dataset.

A²LC: Active and Automated Label Correction for Semantic Segmentation

Improving the quality of hyperspectral images (HSIs), such as through super-resolution, is a crucial research area. However, generative modeling for HSIs presents several challenges. Due to their high spectral dimensionality, HSIs are too memory-intensive for direct input into conventional diffusion models. Furthermore, general generative models lack an understanding of the topological and geometric structures of ground objects in remote sensing imagery. In addition, most diffusion models optimize loss functions at the noise level, leading to a non-intuitive convergence behavior and suboptimal generation quality for complex data. To address these challenges, we propose a Geometric Enhanced Wavelet-based Diffusion Model (GEWDiff), a novel framework for reconstructing hyperspectral images at 4-times super-resolution. A wavelet-based encoder-decoder is introduced that efficiently compresses HSIs into a latent space while preserving spectral-spatial information. To avoid distortion during generation, we incorporate a geometry-enhanced diffusion process that preserves the geometric features. Furthermore, a multi-level loss function was designed to guide the diffusion process, promoting stable convergence and improved reconstruction fidelity. Our model demonstrated state-of-the-art results across multiple dimensions, including fidelity, spectral accuracy, visual realism, and clarity.

GEWDiff: Geometric Enhanced Wavelet-based Diffusion Model for Hyperspectral Image Super-resolution

While existing underwater image compression (UIC) methods optimize for human perception or basic redundancies, they neglect inter-image correlations and fail to prioritize machine-friendly features essential for automated analysis. This paper introduces a novel -quantized (VQ) codebook-driven framework for machine-centric UIC. We leverage VQ codebooks -- pre-trained as external priors on diverse underwater data -- to unify three critical stages: (1) Machine-friendly feature extraction via contrastive learning with high/low-quality codebooks, enhancing degradation robustness; (2) Compact compression using variable-size codebooks to map discriminative features to entropy-coded indices, enabling ultra-low bitrates ($<$0.04bpp); and (3) Feature refinement at the decoder, restoring semantic fidelity for downstream tasks. In addition, we contribute the first Underwater Visual Question Answering (UVQA) benchmark to holistically evaluate machine perception across object presence, counting, and localization. Extensive experiments demonstrate that our framework significantly outperforms state-of-the-art codecs in machine vision task performance at ultra-low bitrates. The VQ-codebook effectively harnesses inter-image redundancy, combats joint degradation, and delivers compact, analysis-friendly representations, establishing a new paradigm for machine-centric UIC.

Downloads

Next from AAAI 2026

SERL: Self-Examining Reinforcement Learning on Open-Domain

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES