Singapore

The introduction of diffusion models has brought significant advances to the field of audio-driven talking head generation. However, the extremely slow inference speed severely limits the practical implementation of diffusion-based talking head generation models. In this study, we propose READ, the first real-time diffusion-transformer-based talking head generation framework. Our approach first learns a spatiotemporal highly compressed video latent space via a temporal VAE, significantly reducing the token count to accelerate generation. To achieve better audio-visual alignment within this compressed latent space, a pre-trained Speech Autoencoder (SpeechAE) is proposed to generate temporally compressed speech latent codes corresponding to the video latent space. These latent representations are then modeled by a carefully designed Audio-to-Video Diffusion Transformer (A2V-DiT) backbone for efficient talking head synthesis. Furthermore, to ensure temporal consistency and accelerated inference in extended generation, we propose a novel asynchronous noise scheduler (ANS) for both the training and inference process of our framework. The ANS leverages asynchronous add-noise and asynchronous motion-guided generation in the latent space, ensuring consistency in generated video clips. Experimental results demonstrate that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime, achieving an optimal balance between quality and speed while maintaining robust metric stability in long-time generation.

AAAI 2026

READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation

gesture & pose

face

biometrics

poster

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

The purpose of the AAAI conference series is to promote research in Artificial Intelligence (AI) and foster scientific exchange between researchers, practitioners, scientists, students, and engineers across the entirety of AI and its affiliated disciplines. AAAI-26 will feature technical paper presentations, special tracks, invited speakers, workshops, tutorials, poster sessions, senior member presentations, competitions, and exhibit programs, and a range of other activities to be announced.<br><br>

To access this event page, you need to log in with the **email address you registered with**. <br>Access credentials will be sent to your email from Underline -  subject line "Welcome to AAAI 2026". Please be sure to check your spam email folder if you do not see an email confirmation right away.

Please log in

To access this event page, you are required to register.
Please complete your registration to continue.

We recommend reading [**the registration information**](https://aaai.org/conference/aaai/aaai-26/registration/) first.

**Online Registration Form**: https://aaai.getregistered.net/conference-2026 

Registration Required

We are pleased to announce the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), which will be held in Singapore EXPO from January 20 to January 27, 2026.

Translating non-invasive signals such as photoplethysmography (PPG) and ballistocardiography (BCG) into clinically meaningful signals like arterial blood pressure (ABP) is vital for continuous, low-cost healthcare monitoring. However, temporal misalignment in multimodal signal transformation impairs waveform accuracy, especially in capturing critical features like ABP peaks. Conventional synchronization methods often rely on strict similarity assumptions or manual tuning, while existing Learning with Noisy Labels (LNL) approaches are ineffective against time-shifted supervision, either discarding excessive data or failing to correct label shifts. To address this, we propose ShiftSyncNet, a meta-learning-based bi-level optimization framework designed to automatically mitigate performance degradation due to time misalignment. It comprises a transformation network (TransNet) and a time-shift correction network (SyncNet), which learns time offsets between training pairs and applies Fourier phase shifts to align supervision signals. Experiments on one real-world industrial dataset and two public datasets show that ShiftSyncNet outperforms baselines by 9.4\%, 6.0\%, and 12.8\%, respectively. Results highlight its effectiveness in correcting time shifts, improving label quality, and enhancing transformation accuracy under diverse misalignments—attributable to SyncNet's label correction capabilities within meta-learning framework.

Lost in Time? A Meta-Learning Framework for Time-Shift-Tolerant Physiological Signal Transformation

Specializing Large Language Models for educational domains is a key frontier in creating personalized learning tools. The central challenge is not data scarcity but its abundance: efficiently selecting a curated data subset from vast corpora to enhance specialized skills and foster generalization, without degrading existing abilities. Existing data selection paradigms, relying on superficial semantic similarity or model training dynamics, often lack a principled framework to identify data that promotes true cognitive growth. Our work proposes a paradigm shift from leveraging indirect proxies of learning value, such as semantic similarity and training dynamics, towards a framework that performs a direct, cognitive-level modeling of the learner's state. We introduce CASS, a novel framework that implements this cognitive approach through a clear pipeline, moving from an initial Diagnosis to the ultimate goal of expanding the model's cognitive frontier. First, CASS diagnoses the LLM's cognitive frontier using Multidimensional Item Response Theory. Leveraging this diagnosis, it then employs Fisher Information to select a data subset situated at LLM's cognitive frontier that offers maximum informational gain. Finally, the model is fine-tuned on this curated data using a structured, easy-to-hard curriculum to ensure effective learning. Experiments on our new multi-subject dataset show that models trained with CASS not only achieve superior accuracy in the target domain but also exhibit enhanced generalization. CASS provides a more efficient, effective, and theoretically-grounded paradigm for building expert educational LLMs.

From Diagnosis to Generalization: A Cognitive Approach to Data Selection for Educational LLMs

Mixed-Integer Linear Programming (MILP) lies at the core of many real-world combinatorial optimization (CO) problems, traditionally solved by branch-and-bound (B\&B). 
A key driver influencing B\&B solvers efficiency is the variable selection heuristic that guides branching decisions. 
Looking to move beyond static, hand-crafted heuristics, recent work has explored adapting traditional reinforcement learning (RL) algorithms to the B\&B setting, aiming to learn branching strategies tailored to specific MILP distributions. 
In parallel, RL agents have achieved remarkable success in board games, a very specific type of combinatorial problems, by leveraging environment simulators to plan via Monte Carlo Tree Search (MCTS).
Building on these developments, we introduce Plan-and-Branch-and-Bound (PlanB\&B), a model-based reinforcement learning (MBRL) agent that leverages a learned internal model of the B\&B dynamics to discover improved branching strategies. 
Computational experiments empirically validate our approach, with our MBRL branching agent outperforming previous state-of-the-art RL methods across four standard MILP benchmarks.

Planning in Branch-and-Bound: Model-Based Reinforcement Learning for Exact Combinatorial Optimization

Chinese Grammar Error Correction (CGEC) aims to identify and correct grammatical errors in Chinese sentences. Fine-tuning Large Language Models (LLMs) is a popular current method. However, we have observed a significant flaw: LLMs learn grammatical knowledge but often fail to explicitly use specific grammatical concepts to correct erroneous sentences, leading to multiple corrections without a clear indication of which is the most reliable. Humans possess an "intuitive thinking" mode, which allows them to quickly decide which correction is more reliable based on experience and intuition. To address this deficiency in LLMs, we propose the Expanding Intuitive Thinking Model (ExIT). ExIT extends the thinking process of LLMs for CGEC, providing them with a human-like rapid decision-making process. This enables LLMs to quickly select a more reliable correction from multiple alternatives based on experience and intuition. Unlike the LLM decoding process, which focuses only on the trustworthiness of local tokens, this is a global thinking process concerning the erroneous sentence and its correction. ExIT is a lightweight model that performs rapid computations without significantly increasing overhead. Our experimental results on CGEC datasets demonstrate that the proposed ExIT can substantially unleash the error correction potential of LLMs.

Intuitive Thinking: Expanding Large Language Models’ Thinking for Rapid Decision-Making on Candidate Corrections in Chinese Grammar Error Correction

Adverse weather conditions—such as rain, fog, and snow—significantly degrade LiDAR point cloud quality, causing substantial performance deterioration in detection models trained on clean data. To address this, we propose LTDNet, a novel point cloud quality improvement net-work that restores degraded LiDAR scans by learning an end-to-end mapping from corrupted to clean geometry. LTDNet leverages position encoding, spatial–frequency joint feature extraction, weather-aware refinement, and probabilistic pruning to effectively recover structural in-tegrity while suppressing weather-induced noise. To fa-cilitate standardized evaluation, we introduce IQA3D, a new benchmark comprising both synthetic and real-world sequences under adverse weather. This dual-design benchmark serves two complementary purposes: synthet-ic sequences provide pixel-wise correspondences between degraded and clean point clouds for quantitatively as-sessing restoration fidelity, while real-world sequences enable evaluation of the practical impact of improvement methods on downstream 3D object detection under au-thentic weather conditions. This makes IQA3D particular-ly suitable for jointly measuring both perceptual quality and task-level robustness of point cloud improvement models. Extensive experiments on IQA3D demonstrate that LTDNet significantly improves detection perfor-mance across various state-of-the-art 3D detectors and three tested weather conditions, making it a practical and effective solution for robust LiDAR-based detection.

Weather-Robust LiDAR Perception: Point Cloud Restoration from Adverse Weather

While diffusion model fine-tuning offers a powerful approach for customizing pre-trained models to generate specific objects, it frequently suffers from overfitting when training samples are limited, compromising both generalization capability and output diversity. This paper tackles the challenging yet most impactful task of adapting a diffusion model using just a single concept image, as single-image customization holds the greatest practical potential. We introduce *T-LoRA*, a **T**imestep-Dependent **Lo**w-**R**ank **A**daptation framework specifically designed for diffusion model personalization. In our work we show that higher diffusion timesteps are more prone to overfitting than lower ones, necessitating a timestep-sensitive fine-tuning strategy. *T-LoRA* incorporates two key innovations: (1) a dynamic fine-tuning strategy that adjusts rank-constrained updates based on diffusion timesteps, and (2) a weight parametrization technique that ensures independence between adapter components through orthogonal initialization. Extensive experiments show that *T-LoRA* and its individual components outperform standard LoRA and other diffusion model personalization techniques. They achieve a superior balance between concept fidelity and text alignment, highlighting the potential of *T-LoRA* in data-limited and resource-constrained scenarios.

T-LoRA: Single Image Diffusion Model Customization Without Overfitting

Precise environmental perception is critical for the reliability of autonomous driving systems. While collaborative perception mitigates the limitations of single-agent perception through information sharing, it encounters a fundamental communication-performance trade-off. Existing communication-efficient approaches typically assume MB-level data transmission per collaboration, which may fail due to practical network constraints. To address these issues, we propose InfoCom, an information-aware framework establishing the pioneering theoretical foundation for communication-efficient collaborative perception via extended Information Bottleneck principles. Departing from mainstream feature manipulation, InfoCom introduces a novel information purification paradigm that theoretically optimizes the extraction of minimal sufficient task-critical information under Information Bottleneck constraints. Its core innovations include: i) An Information-Aware Encoding condensing features into minimal messages while preserving perception-relevant information; ii) A Sparse Mask Generation identifying spatial cues with negligible communication cost; and iii) A Multi-Scale Decoding that progressively recovers perceptual information through mask-guided mechanisms rather than simple feature reconstruction. Comprehensive experiments across multiple datasets demonstrate that InfoCom achieves near-lossless perception while reducing communication overhead from megabyte to kilobyte-scale, representing 440-fold and 90-fold reductions per agent compared to Where2comm and ERMVP, respectively. The code will be open-sourced upon acceptance.

InfoCom: Kilobyte-Scale Communication-Efficient Collaborative Perception with Information Bottleneck

Scenes with water surfaces present a significant challenge for Gaussian Splatting due to the simultaneous presence of refraction and reflection, as well as the difficulty of accurately estimating the geometry of transparent water surfaces. To address this, we propose a novel framework for reconstructing scenes involving both reflection and refraction caused by water surfaces. The water surface is modeled as a trainable plane, and 2D Gaussian ray tracing is applied to account for refraction through the water. We extend 2D Gaussian Splatting by introducing a soft mask parameter and a dual set of Gaussian primitives, which handle both reflected and refracted effects. Our method achieves state-of-the-art performance on newly constructed water surface datasets, including both synthetic and real scenes, and significantly outperforms prior approaches in water-interacting regions. Furthermore, we demonstrate the editability of our model by manipulating the index of refraction to suppress or modify refractive effects, enabling scene transformations into different liquids.

Through the Water: Refractive Gaussian Splatting for Water Surface Scenes

Identification of fine-grained embryo developmental stages during In Vitro Fertilization (IVF) is crucial for assessing embryo viability. Although recent deep learning methods have achieved promising accuracy, existing approaches based on discriminative models fail to utilize the distributional prior of embryonic development. Moreover, they suffer from incomplete embryonic representation due to their reliance on single-focal information, thereby making them susceptible to feature ambiguity caused by cell occlusions. To address these limitations, we propose EmbryoDiff, a two-stage diffusion-based framework that utilizes sequence features as condition signals for accurate stage recognition. Specifically, in the first stage, a frame-level encoder is trained and fixed to extract robust multi-focal visual features for training the diffusion model. In the second stage, we introduce a Multi-Focal Feature Fusion strategy that integrates information across focal planes to build a morphological representation with 3D contextual awareness, mitigating ambiguity caused by cell occlusions. Based on the fused features, we further extract complementary semantic and boundary condition features and design a Hybrid Semantic-Boundary Condition Block to effectively inject them into the denoising process for accurate stage classification. Extensive experiments on two benchmark datasets demonstrate that our method achieves state-of-the-art performance. Notably, our model attains optimal average test performance with only one denoising step, achieving 82.8% and 81.3% accuracy on the two datasets, respectively.

EmbryoDiff: A Conditional Diffusion Framework with Multi-Focal Feature Fusion for Fine-Grained Embryo Developmental Stage Recognition

Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, but often exhibit overconfidence and generate plausible yet incorrect answers. This overconfidence, especially in models undergone Reinforcement Learning from Human Feedback (RLHF), poses significant challenges for reliable uncertainty estimation and safe deployment.
In this paper, we propose EAGLE (Expectation of AGgregated internaL bEief), a novel self-evaluation-based calibration method that leverages the internal hidden states of LLMs to derive more accurate confidence scores.
Instead of relying on the model's final output, our approach extracts internal beliefs from multiple intermediate layers during self-evaluation. By aggregating these layer-wise beliefs and calculating the expectation over the resulting confidence score distribution, EAGLE produces a refined confidence score that more faithfully reflects the model's internal certainty. Extensive experiments on diverse datasets and LLMs demonstrate that EAGLE significantly improves calibration performance over existing baselines.
We also provide an in-depth analysis of EAGLE, including a layer-wise examination of uncertainty patterns, a study of the impact of self-evaluation prompts, and an analysis of the effect of self-evaluation score range.

Content not yet available

Next from AAAI 2026

Lost in Time? A Meta-Learning Framework for Time-Shift-Tolerant Physiological Signal Transformation

Stay up to date with the latest Underline news!

PRESENTATIONS

CONFERENCES

COMPANY

RESOURCES