Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
The task of video-to-video human motion editing aims to transfer motion from a specific video to a reference video while preserving the background dynamics and the original protagonist's appearance. From analysis, we identify critical limitations in existing models that fail to capture the full complexity of human motions, particularly regarding 1) location changes, 2) orientation variations, and 3) complicated non-upright poses. To address these challenges, we propose a framework that selectively "copies and pastes" 2D and 3D features across spatio-temporal dimensions into a shared representation space for motion guidance. This is achieved through: 1) a mutual distillation mechanism that enhances the robustness and capability of individual encoders, and 2) a selective fusion module that adaptively weights and combines complementary information from spatio-temporal representations. To push the limits of motion editing algorithms with challenging scenarios, we introduce an evaluation dataset comprising real-world video clips from artistic gymnastics and figure skating competitions. These sports disciplines naturally encompass the aforementioned three aspects of motion complexity. Experiments demonstrate that our approach significantly outperforms existing methods, particularly in handling intricate human motions.