Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Temporal Action Detection (TAD) aims to identify specific actions in long, untrimmed videos by determining their start, end times and categories, yet existing models suffer from performance degradation under out-of-distribution scenarios due to unrealistic i.i.d. assumptions. While domain generalization (DG) offers a promising solution, image-based DG methods fail to address the unique spatiotemporal challenges in video-based TAD, including the spatiotemporal complexities and significant variations in action instance scales and densities across domains. To bridge this gap, we propose the first DG framework tailored for TAD. We propose Scene-Aware Video Segmentation, which segments videos based on semantic similarity, addressing cross-domain action instance density and scale discrepancies. Additionally, we present Temporal-Aware Normalization Perturbation to generate diverse video features while preserving temporal integrity. We establish the first DG-TAD benchmark, evaluating 11 state-of-the-art DG methods across four datasets. The experiments demonstrate that our framework consistently outperforms existing approaches, achieving superior generalization on unseen domains. The proposed modules are architecture-agnostic, offering plug-and-play compatibility for broader video understanding tasks.
