Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Video Captioning aims to generate comprehensive and coherent descriptions of video content, contributing to the advancement of both video understanding and video generation. However, existing methods for generating video captions often suffer from Motion-Detail imbalance—models tend to overemphasize one aspect while neglecting the other. To address this issue, we propose solutions from two aspects: 1) Dataset aspect, we constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). Compared with previous video captioning datasets, HMD-270K features longer captions with more balanced and comprehensive motion-detail descriptions, directly mitigating the Motion-Detail imbalance problem. 2) Optimization aspect, we introduce the Caption Set Equivalence Reward (CSER) based on GRPO. Through a subset-to-set matching and bidirectional validation strategy. Compared with previous video captioning rewards, CSER adopts a more fine-grained approach to optimize the completeness and correctness of captions. Based on the HMD-270K and CSER post-training, we developed OwlCap, a powerful Video Captioning multi-modal large language model (MLLM) with Motion-Detail balance capabilities. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1 Score). Experiments on the downstream Text-to-Video (T2V) task further confirm OwlCap’s superior video captioning capability. The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.