Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos containing relevant moments for a given text query. This task is extremely challenging, as untrimmed videos often include numerous actions and objects unrelated to the query. However, existing methods usually struggle with fine-grained action-object modeling, limiting their retrieval performance. To tackle this challenge, we introduce Action-and-object Aware Alignment for Partially Relevant Video Retrieval (A$^3$PRVR), a dual-branch framework designed to enhance retrieval by improving the modeling of action-object relationships. Specifically, we propose a Query-specific Deformable Temporal Attention (Q-DTA) module to effectively capture action-relevant object information in video features, while filtering out irrelevant content. Additionally, we propose an action-and-object aware alignment module to enable fine-grained textual understanding and video-text alignment. It uses action- and object-aware contrastive losses to enhance the model's sensitivity to action-object distinctions in the text query. Compared to state-of-the-art methods, A$^3$PRVR achieves an average relative gain of 6.5% in SumR across the Charades-STA, ActivityNet-Caption, and TVR datasets.