Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Diffusion-based talking head models generate high-quality, photorealistic videos but suffer from slow inference, limiting practical applications. Existing acceleration methods for gen- eral diffusion models fail to exploit the temporal and spatial redundancies unique to talking head generation. In this paper, we propose a task-specific framework addressing these inef- ficiencies through two key innovations. First, we introduce Lightning-fast Caching-based Parallel denoising predic- tion (LightningCP), caching static features to bypass most model layers in inference time. We also enable parallel pre- diction using cached features and estimated noisy latents as inputs, efficiently bypassing sequential sampling. Second, we propose Decoupled Foreground Attention (DFA) to further accelerate attention computations, exploiting the spatial de- coupling in talking head videos to restrict attention to dynamic foreground regions. Additionally, we remove reference fea- tures in certain layers to bring extra speedup. Extensive exper- iments demonstrate that our framework significantly improves inference speed while preserving video quality.
