Singapore

Video Captioning aims to generate comprehensive and coherent descriptions of video content, contributing to the advancement of both video understanding and video generation. However, existing methods for generating video captions often suffer from Motion-Detail imbalance—models tend to overemphasize one aspect while neglecting the other. To address this issue, we propose solutions from two aspects: 1) Dataset aspect, we constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). Compared with previous video captioning datasets, HMD-270K features longer captions with more balanced and comprehensive motion-detail descriptions, directly mitigating the Motion-Detail imbalance problem. 2) Optimization aspect, we introduce the Caption Set Equivalence Reward (CSER) based on GRPO. Through a subset-to-set matching and bidirectional validation strategy. Compared with previous video captioning rewards, CSER adopts a more fine-grained approach to optimize the completeness and correctness of captions. Based on the HMD-270K and CSER post-training, we developed OwlCap, a powerful Video Captioning multi-modal large language model (MLLM) with Motion-Detail balance capabilities.
Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1 Score). Experiments on the downstream Text-to-Video (T2V) task further confirm OwlCap’s superior video captioning capability. The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.

AAAI 2026

OwlCap: Harmonizing Motion-Detail for Video Captioning via HMD-270K and Caption Set Equivalence Reward

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Generative models have shown remarkable performance in speech enhancement (SE), achieving superior perceptual quality over traditional discriminative approaches. However, existing generative SE approaches often overlook the risk of hallucination under severe noise, leading to incorrect spoken content or inconsistent speaker characteristics, which we term linguistic and acoustic hallucinations, respectively.
We argue that linguistic hallucination, stemming from models' failure to constrain valid phonological structures, is the more fundamental challenge.
While language models (LMs) are well-suited for capturing the underlying speech structure through modeling the distribution of discrete tokens, existing approaches are limited in learning from noise-corrupted representations, which can lead to contaminated priors and hallucinations.
To overcome these limitations, we propose the Phonologically Anchored Speech Enhancer (PASE), a generative SE framework that leverages the robust phonological prior embedded in the pre-trained WavLM model to mitigate hallucinations. 
First, we adapt WavLM into a denoising expert via representation distillation to clean its final-layer features. Guided by the model's intrinsic phonological prior, this process enables it to perform robust denoising with a strong resistance to hallucination.
To further reduce acoustic hallucinations, we train the vocoder with a dual-stream representation: the high-level phonetic representation provides clean linguistic content, while a low-level acoustic representation retains speaker identity and prosody.
Experimental results demonstrate that PASE not only surpasses state-of-the-art discriminative models in perceptual quality, but also significantly outperforms prior generative models with substantially lower linguistic and acoustic hallucinations.

PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement

Multi-view multi-label feature selection aims to identify informative features from heterogeneous views, where each sample is associated with multiple interdependent labels. This problem is particularly important in machine learning involving high-dimensional, multimodal data such as social media, bioinformatics or recommendation systems. Existing Multi-View Multi-Label Feature Selection (MVMLFS) methods mainly focus on analyzing statistical information of data, but seldom consider semantic information. In this paper, we aim to use these two types of information jointly and propose a method that combines Large Language Models (LLMs) semantic reasoning with Graph Neural Networks (GNNs) structural modeling for MVMLFS. Specifically, the method consists of three main components. (1) LLM is first used as an evaluation agent to assess the latent semantic relevance among feature, view, and label descriptions. (2) A semantic-aware heterogeneous graph with two levels is designed to represent relations among features, views and labels: one is a semantic graph representing semantic relations, and the other is a statistical graph. (3) A lightweight Graph Attention Network (GAT) is applied to learn node embedding in the heterogeneous graph as feature saliency scores for ranking and selection. Experimental results on multiple benchmark datasets demonstrate the superiority of our method over state-of-the-art baselines, and it is still effective when applied to small-scale datasets, showcasing its robustness, flexibility, and generalization ability.

Combining LLM Semantic Reasoning with GNN Structural Modeling for Multi-View Multi-Label Feature Selection

Graph neural networks are increasingly applied to multimodal medical diagnosis for their inherent relational modeling capabilities. However, their efficacy is often compromised by the prevailing reliance on a single, static graph built from indiscriminate features, hindering the ability to model patient-specific pathological relationships. To this end, the proposed Multi-Activation Plane Interaction Graph Neural Network (MAPI-GNN) reconstructs this single-graph paradigm by learning a multifaceted graph profile from semantically disentangled feature subspaces. The framework first uncovers latent graph-aware patterns via a multi-dimensional discriminator; these patterns then guide the dynamic construction of a stack of activation graphs; and this multifaceted profile is finally aggregated and contextualized by a relational fusion engine for a robust diagnosis. Extensive experiments on two diverse tasks, comprising over 1300 patient samples, demonstrate that MAPI-GNN significantly outperforms state-of-the-art methods. Code is available on Github.

MAPI-GNN: Multi-Activation Plane Interaction Graph Neural Network for Multimodal Medical Diagnosis

3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance. The code will be released.

HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models

Point cloud quality assessment faces a critical disconnect: existing methods operate on a flawed single-perception paradigm, while human observers evaluate quality through dual cognitive streams: **technical rationality** and **semantic sensibility**. This fundamental mismatch routinely produces catastrophic assessment failures in real-world scenarios where technical and semantic signals conflict. To address this, we introduce Dual-Stream Perception PCQA (DSP-PCQA), the first framework that explicitly models this perceptual duality through parallel networks thoroughly mirroring the human cognitive pathway. DSP-PCQA introduces three key innovations: (1) a **Decoupled Focus Enhancer (DFE)** that surgically isolates technical and semantic information using two targeted transformations; (2) a **Context & Attribute Correlation Awareness (CACA)** module that captures the dynamic, non-linear relationships between different views and sub-models characteristic of human visual processing; and (3) an **Exchange-based Perceptual Injection (EPI)** module that strategically transfers information between perception streams, simulating how humans integrate multiple perceptual dimensions. Extensive evaluations show DSP-PCQA dramatically outperforms state-of-the-art methods across multiple datasets. Most importantly, our method resolves the perceptual discord that plagues existing approaches, maintaining high accuracy even in the challenging boundary cases where technical quality and semantic significance diverge wildly, precisely where conventional methods fail catastrophically.

DSP-PCQA: Integrating Multiple Perception Preferences for Point Cloud Quality Assessment

Prefix adders are widely used in compute-intensive applications for their high speed. However, designing these adders for an optimal area-delay trade-off is challenging due to strict design rules and an exponentially large design space. We introduce PrefixGPT, a generative pre-trained Transformer (GPT) that directly generates optimized prefix adders from scratch. Our approach represents an adder’s topology as a sequence of two-dimensional coordinates and applies a legality mask during generation, ensuring every design is valid by construction. To efficiently generate the sequence, PrefixGPT features a customized decoder-only architecture that adapts the standard Transformer model for spatial coordinate prediction. The model is trained in two stages: it is first pre-trained on a corpus of randomly synthesized valid prefix adders to learn the design rules and then fine-tuned to navigate the design space for optimized design quality. Compared with existing works, PrefixGPT not only finds a new optimal design with a 7.7% improved area-delay product (ADP) but also exhibits superior exploration quality, lowering the average ADP by up to 79.1% and slashing its standard deviation by over 94%. This demonstrates the potential of GPT-style models to first master complex hardware design principles and then apply them for more efficient design optimization. To ensure reproducibility and facilitate future research, all of our code, data, and models are publicly available at XXXXX.

PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer

With the rapid growth of the Internet of Things (IoT), integrating artificial intelligence (AI) on extremely weak embedded devices has garnered significant attention, enabling improved real-time performance and enhanced data privacy.
However, the resource limitations of such devices and unreliable network conditions necessitate error-resilient device-edge collaboration systems.
Traditional approaches often focus on bit-level transmission correctness, which can be inefficient under dynamic channel conditions. 
In contrast, we propose SemanticNN, a semantic codec that tolerates bit-level errors in pursuit of semantic-level correctness, enabling compressive and resilient collaborative inference offloading under strict computational and communication constraints.
It incorporates a Bit Error Rate (BER)-aware decoder %to sense the dynamically changing BERs that adapts to dynamic channel conditions and a Soft Quantization (SQ)-based encoder to learn compact representations.
Building on this architecture, we introduce Feature-augmentation Learning, a novel training strategy that enhances offloading efficiency.
Furthermore, to address the mismatch between encoder and decoder capabilities caused by asymmetric computational resources,
we propose eXplainable AI (XAI)-based Asymmetry Compensation, which improves semantic fidelity during decoding.
We conduct extensive experiments on STM32 using three models and six public datasets across image classification and object detection tasks.
Experimental results demonstrate that, under varying transmission error rates, SemanticNN significantly reduces feature transmission volume by 56.82–344.83× while maintaining superior inference accuracy. Code is available at https://anonymous.4open.science/r/semanticnn/.

SemanticNN: Compressive and Error-Resilient Semantic Offloading for Extremely Weak Devices

Electromyography (EMG)-based gesture recognition has emerged as a promising approach for human-computer interaction. However, its performance is often limited by the scarcity of labeled EMG data, significant cross-user variability, and poor generalization to unseen gestures. To address these challenges, we propose SeqEMG-GAN, a conditional, sequence-driven generative framework that synthesizes high-fidelity EMG signals from hand joint angle sequences. Our method introduces a context-aware architecture composed of an angle encoder, a dual-layer context encoder featuring the novel Ang2Gist unit, a deep convolutional EMG generator, and a discriminator, all jointly optimized via adversarial learning. By conditioning on joint kinematic trajectories, SeqEMG-GAN is capable of generating semantically consistent EMG sequences, even for previously unseen gestures, thereby enhancing data diversity and physiological plausibility. Experimental results show that classifiers trained solely on synthetic data experience only a slight accuracy drop (from 57.77% to 55.71%). In contrast, training with a combination of real and synthetic data significantly improves accuracy to 60.53%, outperforming real-only training by 2.76%. These findings demonstrate the effectiveness of our framework, also achieves the state-of-art performance in augmenting EMG datasets and enhancing gesture recognition performance for applications such as neural robotic hand control, AI/AR glasses, and gesture-based virtual gaming systems.

New Synthetic Goldmine: Hand Joint Angle-Driven EMG Data Generation Framework for Micro-Gesture Recognition

Graphical User Interface (GUI) task automation constitutes a critical frontier in artificial intelligence research. While effective GUI agents synergistically integrate planning and grounding capabilities, current methodologies exhibit two fundamental limitations: (1) insufficient exploitation of cross-model synergies, and (2) over-reliance on synthetic data generation without sufficient utilization. To address these challenges, we propose Co-EPG, a self-iterative training framework for Co-Evolution of Planning and Grounding. Co-EPG establishes an iterative positive feedback loop: through this loop, the planning model explores superior strategies under grounding-based reward guidance via Group Relative Policy Optimization (GRPO), generating diverse data to optimize the grounding model. Concurrently, the optimized Grounding model provides more effective rewards for subsequent GRPO training of the planning model, fostering continuous improvement. Co-EPG thus enables iterative enhancement of agent capabilities through self-play optimization and training data distillation. On the Multimodal-Mind2Web and AndroidControl benchmarks, our framework outperforms existing state-of-the-art methods after just three iterations without requiring external data. The agent consistently improves with each iteration, demonstrating robust self-enhancement capabilities. This work establishes a novel training paradigm for GUI agents, shifting from isolated optimization to an integrated, self-driven co-evolution approach.

Co-EPG: A Framework for Co-Evolution of Planning and Grounding in Autonomous GUI Agents

In this paper, we explore the transferability of SSL by addressing two central questions: (i) what is the representation transferability of SSL, and (ii) how can we effectively model this transferability? Transferability is defined as the ability of a representation learned from one task to support the objective of another. Inspired by the meta-learning paradigm, we construct multiple SSL tasks within each training batch to support explicitly modeling transferability. Based on empirical evidence and causal analysis, we find that although introducing task-level information improves transferability, it is still hindered by task conflict. To address this issue, we propose a Task Conflict Calibration (TC$^2$) method to alleviate the impact of task conflict. Specifically, it first splits batches to create multiple SSL tasks, infusing task-level information. Next, it uses a factor extraction network to produce causal generative factors for all tasks and a weight extraction network to assign dedicated weights to each sample, employing data reconstruction, orthogonality, and sparsity to ensure effectiveness. Finally, TC$^2$
calibrates sample representations during SSL training and integrates into the pipeline via a two-stage bi-level optimization framework to boost the transferability of learned representations. Experimental results on multiple downstream tasks demonstrate that our method consistently improves the transferability of SSL models.

Next from AAAI 2026

PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES