AAAI 2026

January 22, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Video Captioning aims to generate comprehensive and coherent descriptions of video content, contributing to the advancement of both video understanding and video generation. However, existing methods for generating video captions often suffer from Motion-Detail imbalance—models tend to overemphasize one aspect while neglecting the other. To address this issue, we propose solutions from two aspects: 1) Dataset aspect, we constructed the Harmonizing Motion-Detail 270K (HMD-270K) dataset through a two-stage pipeline: Motion-Detail Fusion (MDF) and Fine-Grained Examination (FGE). Compared with previous video captioning datasets, HMD-270K features longer captions with more balanced and comprehensive motion-detail descriptions, directly mitigating the Motion-Detail imbalance problem. 2) Optimization aspect, we introduce the Caption Set Equivalence Reward (CSER) based on GRPO. Through a subset-to-set matching and bidirectional validation strategy. Compared with previous video captioning rewards, CSER adopts a more fine-grained approach to optimize the completeness and correctness of captions. Based on the HMD-270K and CSER post-training, we developed OwlCap, a powerful Video Captioning multi-modal large language model (MLLM) with Motion-Detail balance capabilities. Experimental results demonstrate that OwlCap achieves significant improvements compared to baseline models on two benchmarks: the detail-focused VDC (+4.2 Acc) and the motion-focused DREAM-1K (+4.6 F1 Score). Experiments on the downstream Text-to-Video (T2V) task further confirm OwlCap’s superior video captioning capability. The HMD-270K dataset and OwlCap model will be publicly released to facilitate video captioning research community advancements.

Downloads

Paper

Next from AAAI 2026

PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement
poster

PASE: Leveraging the Phonological Prior of WavLM for Low-Hallucination Generative Speech Enhancement

AAAI 2026

+2
Qinwen Hu and 4 other authors

22 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved