Singapore

Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference, limiting practical applications. Existing acceleration methods for gen- eral diffusion models fail to exploit the temporal and spatial redundancies unique to talking head generation. In this paper, we propose a task-specific framework addressing these inef- ficiencies through two key innovations. First, we introduce Lightning-fast Caching-based Parallel denoising predic- tion (LightningCP), caching static features to bypass most model layers in inference time. We also enable parallel pre- diction using cached features and estimated noisy latents as inputs, efficiently bypassing sequential sampling. Second, we propose Decoupled Foreground Attention (DFA) to further accelerate attention computations, exploiting the spatial de- coupling in talking head videos to restrict attention to dynamic foreground regions. Additionally, we remove reference fea- tures in certain layers to bring extra speedup. Extensive exper- iments demonstrate that our framework significantly improves inference speed while preserving video quality.

AAAI 2026

Lightning Fast Caching-based Parallel Denoising Prediction for Accelerating Talking Head Generation

computer vision (cv)

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Predicting single-cell perturbation outcomes directly advances gene function analysis and facilitates drug candidate selection, making it a key driver of both basic and translational biomedical research. However, a major bottleneck in this task is the unpaired nature of single-cell data, as the same cell cannot be observed both before and after perturbation due to the destructive nature of sequencing. Although some neural generative transport models attempt to tackle unpaired single-cell perturbation data, they either lack explicit conditioning or depend on prior spaces for indirect distribution alignment, limiting precise perturbation modeling. In this work, we approximate Schrödinger Bridge (SB), which defines stochastic dynamic mappings recovering the entropy-regularized optimal transport (OT), to directly align the distributions of control and perturbed single-cell populations across different perturbation conditions. Unlike prior SB approximations that rely on bidirectional modeling to infer optimal node couplings, we leverage Minibatch-OT based node-level coupling to avoid such bidirectional inference and the associated ill-posedness of defining the reverse process. This coupling directly guides bridge learning, yielding a scalable approximation to the SB. We approximate two SBs, one modeling discrete gene activation states and the other continuous expression distributions. Joint training enables accurate perturbation modeling and captures single-cell heterogeneity. Experiments on public genetic and drug perturbation datasets show that our model effectively captures heterogeneous single-cell responses and achieves state-of-the-art performance.

Departures: Distributional Transport for Single-Cell Perturbation Prediction with Neural Schrödinger Bridges

Spiking Neural Networks (SNNs) promise significant energy efficiency by processing information via sparse, event-driven spikes. However, realizing this potential is hindered by the conventional use of a rigid, uniform timestep, $T$. This constraint imposes a challenging trade-off between accuracy and latency, while also incurring the prohibitive training costs of Backpropagation Through Time (BPTT). To overcome this limitation, we introduce the Pseudo-Spiking Neuron (PseudoSN), a novel training proxy that conceptualizes latency as an intrinsic, learnable parameter for each neuron. Building on the efficiency of rate-based methods, the PseudoSN models temporal dynamics in a single, BPTT-free pass. It employs a learnable probabilistic noise scheme to emulate the discretization effects of spike generation (e.g., clipping and quantization), making the neuron-specific timestep—and thus latency—directly optimizable via backpropagation. Integrated into a hardware-aware objective, our framework trains heterogeneous-latency SNNs that autonomously learn to optimize the trade-offs among accuracy, latency and energy, establishing a new state-of-the-art on major benchmarks.

Pseudo-Spiking Neurons: A Noise-Based Training Framework for Heterogeneous-Latency Spiking Neural Networks

Multi-modal object Re-Identification (ReID) aims to aggregate complementary information from different modalities to retrieve specific objects. Existing methods often rely on hard token filtering or simple fusion strategies, which can lead to the loss of discriminative cues and increased background interference. To address these challenges, we propose STMI, a novel learning framework composed of three key components: (1) Segmentation-Guided Feature Modulation (SFM) module leverages SAM-generated masks to enhance foreground representations and suppress background noise through learnable attention modulation; (2) Semantic Token Reallocation (STR) module employs learnable query tokens and an adaptive reallocation mechanism to extract compact and informative representations without discarding any tokens; (3) Cross-Modal Hypergraph Interaction (CHI) module constructs a unified hypergraph across modalities to capture high-order semantic relationships. Extensive experiments on public datasets (i.e., RGBNT201, RGBNT100, and MSVR310) demonstrate the effectiveness and robustness of our proposed STMI framework in multi-modal ReID scenarios. The source code is available at https://github.com/young6man/STMI.

STMI: Segmentation-Guided Token Modulation with Cross-Modal Hypergraph Interaction for Multi-Modal Object Re-Identification

Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments for a given text query. This task is extremely challenging, as untrimmed videos often include numerous actions and objects unrelated to the query. However, existing methods usually struggle with fine-grained action-object modeling, limiting their retrieval performance. To tackle this challenge, we introduce Action-and-object Aware Alignment for Partially Relevant Video Retrieval (A$^3$PRVR), a dual-branch framework designed to enhance retrieval by improving the modeling of action-object relationships. Specifically, we propose a Query-specific Deformable Temporal Attention (Q-DTA) module to effectively capture action-relevant object information in video features, while filtering out irrelevant content. Additionally, we propose an action-and-object aware alignment module to enable fine-grained textual understanding and video-text alignment. It uses action- and object-aware contrastive losses to enhance the model's sensitivity to action-object distinctions in the text query. Compared to state-of-the-art methods, A$^3$PRVR achieves an average relative gain of 6.5% in SumR across the Charades-STA, ActivityNet-Caption, and TVR datasets.

Action-and-object Aware Alignment for Partially Relevant Video Retrieval

3D Visual Grounding (3DVG) aims to localize the referent of natural language referring expressions through two core tasks: Referring Expression Comprehension (3DREC) and Segmentation (3DRES). While existing methods achieve high accuracy in simple, single-object scenes, they suffer from severe performance degradation in complex, multi-object scenes—common in real-world settings, hindering practical deployment. Existing methods face two key challenges in complex, multi-object scenes: inadequate parsing of implicit localization cues critical for disambiguating visually similar objects, and ineffective suppression of dynamic spatial interference from co-occurring objects, resulting in degraded grounding accuracy. To address these challenges, we propose PC-CrossDiff, a unified dual-task framework with a dual-level cross-modal differential attention architecture for 3DREC and 3DRES. Specifically, the framework introduces: (i) Point-Level Differential Attention (PLDA) modules that apply bidirectional differential attention between text and point clouds, adaptively extracting implicit localization cues via learnable weights to improve discriminative representation; (ii) Cluster-Level Differential Attention (CLDA) modules that establish a hierarchical attention mechanism to adaptively enhance localization-relevant spatial relationships while suppressing ambiguous or irrelevant spatial relations through a localization-aware differential attention block. To address the scale disparity and conflicting gradients in joint 3DREC–3DRES training, we propose $\mathcal{L}_{\text{DGTL}}$, a unified loss function that explicitly reduces multi-task crosstalk and enables effective parameter sharing across tasks. Our method achieves state-of-the-art performance on the ScanRefer, NR3D, and SR3D benchmarks. Notably, on the Implicit subsets of ScanRefer, it improves the Overall@0.50 score by $\textbf{+10.16\%}$ for the 3DREC task, highlighting its strong ability to parse implicit spatial cues.

PC-CrossDiff: Point-Cluster Dual-Level Cross-Modal Differential Attention for Unified 3D Referring and Segmentation

Diffusion probabilistic models have set a new standard for generative fidelity but are hindered by a slow iterative sampling process. A powerful training-free strategy to accelerate this process is Schedule Optimization, which aims to find an optimal distribution of timesteps for a fixed and small Number of Function Evaluations (NFE) to maximize sample quality. To this end, a successful schedule optimization method must adhere to four core principles: effectiveness, adaptivity, practical robustness, and computational efficiency. However, existing paradigms struggle to satisfy these principles simultaneously, motivating the need for a more advanced solution. To overcome these limitations, we propose the Hierarchical-Schedule-Optimizer (HSO), a novel and efficient bi-level optimization framework. HSO reframes the search for a globally optimal schedule into a more tractable problem by iteratively alternating between two synergistic levels: an upper-level global search for an optimal initialization strategy and a lower-level local optimization for schedule refinement. This process is guided by two key innovations: the Midpoint Error Proxy (MEP), a solver-agnostic and numerically stable objective for effective local optimization, and the Spacing-Penalized Fitness (SPF) function, which ensures practical robustness by penalizing pathologically close timesteps. Extensive experiments show that HSO sets a new state-of-the-art for training-free sampling in the extremely low-NFE regime. For instance, with an NFE of just 5, HSO achieves a remarkable FID of 11.94 on LAION-Aesthetics with Stable Diffusion v2.1. Crucially, this level of performance is attained not through costly retraining, but with a one-time optimization cost of less than 8 seconds, presenting a highly practical and efficient paradigm for diffusion model acceleration.

Hierarchical Schedule Optimization for Fast and Robust Diffusion Model Sampling

User behavior sequences in modern recommendation systems exhibit significant length heterogeneity, ranging from sparse short-term interactions to rich long-term histories. While longer sequences provide more context, we observe that increasing the maximum input sequence length in existing CTR models paradoxically degrades performance for short-sequence users due to attention polarization and length imbalance in training data. To address this, we propose LAIN (Length-Aware Interest Network), a plug-and-play framework that explicitly incorporates sequence length as a conditioning signal to balance long- and short-sequence modeling. LAIN consists of three lightweight components: a Spectral Length Encoder that maps length into continuous representations, Length-Conditioned Prompting that injects global contextual cues into both long- and short-term behavior branches, and Length-Modulated Attention that adaptively adjusts attention sharpness based on sequence length. Extensive experiments on three real-world benchmarks and five strong CTR backbones show that LAIN consistently improves overall performance, achieving up to +1.15% AUC gain and 1.63% log loss reduction. Notably, our method significantly improves accuracy for short-sequence users without sacrificing long-sequence effectiveness. Our contributions offer a general, efficient, and deployable solution to mitigate length-induced bias in sequential recommendation.

Length-Adaptive Interest Network for Balancing Long and Short Sequence Modeling in CTR Prediction

Multi-graph multi-label learning (MGML) represents each object as a bag-of-graphs with multiple labels, but demands large-scale labeled data whose acquisition is often difficult and costly. Self-supervised contrastive learning (SCL) mitigates label dependence by leveraging data augmentation to construct discriminative pretext tasks, proving effective for multi-instance learning. However, when applied to MGML, SCL faces two key challenges: (1) it distinguishes individual instances by their differences, whereas MGML requires modeling label correlations; (2) it assumes semantic invariance under augmentation, but structural perturbations in MGML alter label semantics. To tackle these challenges, we propose a self-suPervised contrastive rE-learning framework for mulTi-grAph multi-labeL classification (PETAL). Specifically, to model label correlations, we first define a unified label space to learn label prototypes and align features with them, yielding prototype-aligned representations. We then design a multi-granularity contrastive loss over these representations, which captures label dependencies by contrasting at the bag level, graph level, and bag-graph level. Moreover, to ensure semantic invariance, we develop a contrastive re-learning strategy based on prototype-aligned representations to generate augmentation-free positive samples. This guarantees consistent multi-label distributions without structural perturbations. Experiments on six datasets demonstrate that PETAL achieves an average improvement of 4.12\% over state-of-the-art self-supervised and supervised baselines.

Self-Supervised Contrastive Re-Learning for Multi-Graph Multi-Label Classification

Safe Reinforcement Learning (RL) often faces significant issues such as constraint violations and instability, necessitating the use of constrained policy optimization, which seeks optimal policies while ensuring adherence to specific constraints like safety. Typically, constrained optimization problems are addressed by the Lagrangian method, a post-violation remedial approach that may result in oscillations and overshoots. Motivated by this, we propose a novel method named Proactive Constrained Policy Optimization (PCPO) that incorporates a preemptive penalty mechanism. This mechanism integrates barrier items into the objective function as the policy nears the boundary, imposing a cost. Meanwhile, we introduce a constraint-aware intrinsic reward to guide boundary-aware exploration, which is activated only when the policy approaches the constraint boundary. We establish theoretical upper and lower bounds for the duality gap and the performance of the PCPO update, shedding light on the method's convergence characteristics. Additionally, to enhance the optimization performance, we adopt a policy iteration approach. An interesting finding is that PCPO demonstrates significant stability in experiments. Experimental results indicate that the PCPO framework provides a robust solution for policy optimization under constraints, with important implications for future research and practical applications.

Proactive Constrained Policy Optimization with Preemptive Penalty

In this paper, MoEG-HOI is proposed as a novel method for the challenging 3D hand-object interaction (HOI) motion generation task, by introducing Mixture-of-Experts (MoE) to this field for the first time. Almost all the mainstream approaches in HOI motion generation leverage diffusion model as its strong generative ability. Nevertheless, due to HOI’s fine-grained property, well training diffusion in a one-stage way is actually not trivial. Existing state-of-the-art (SOTA) methods (e.g.,Text2HOI and MF-MDM) alleviate this mainly via a coarse-to-fine, multi-stage paradigm. Although effective and practical, this paradigm prevents end-to-end training for optimal performance. In contrast, MoEG-HOI applies MoE to address this in one-stage way, with end-to-end training ability. This allows each expert to specialize in certain distinct HOI patterns, which alleviates individual expert’s training difficulty. However, intuitively applying MoE is not optimal due to the issues of: (1) towards expert design, original MoE cannot well characterize hand’s articulated structure at the levels of hand, finger, and joint explicitly, and (2) for expert routing mechanism, the characteristics of variational HOI action classes and diffusion noise levels have not been concerned. Towards the first problem, MoE’s experts are designed into groups that correspond to motion generation for hand, finger, and joint respectively, under the semantic guidance from global to local. To facilitate this, HOI’s text description will be correspondingly refined at Hand-Finger-Joint levels using LLM. Secondly, during MoE routing, the information of HOI’s action label and diffusion noise level is concerned to select experts jointly, to better reveal actions’ inter-class variation and dynamics of diffusion generation. SOTA performance on ARCTIC, GRAB and H2O datasets demonstrates the effectiveness of our method.

Downloads

Next from AAAI 2026

Departures: Distributional Transport for Single-Cell Perturbation Prediction with Neural Schrödinger Bridges

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES