AAAI 2026

January 25, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Massive multi-modality datasets are fundamental to the success of large video-language models. However, existing datasets often focus on providing textual descriptions for visual content, treating audio, particularly music, as weakly related information. This overlooks the inherent semantic correlation between visual narratives and musical scores, limiting the development of models for fine-grained cross-modal understanding and generation. To address this gap, we introduce VMChill, a large-scale, fine-grained multimodal video dataset. We leverage trailers as our data source, as they are professionally edited to create a strong synergy between visual pacing, scene transitions, and background music for narrative and emotional impact. Our dataset comprises over 20 million video clips derived from more than 27.1k hours of high-resolution trailer videos. To annotate this data, we propose a systematic multimodal captioning framework. This framework first employs specialized unimodal models to extract descriptive features from multiple perspectives, including visual content, motion dynamics, and musical attributes (e.g., genre, instruments, mood). Subsequently, a large language model (LLM) is utilized to adaptively fuse these diverse descriptions into a single, coherent, and rich multimodal caption. This process yields VMChill-2M, a high-quality subset of 2 million clips with detailed multimodal annotations, and VMChill-Test, a manually refined test set for evaluation. We conduct extensive experiments on downstream tasks, including video understanding and generation, to establish benchmarks and demonstrate the dataset's quality. The results validate that VMChill effectively enhances model performance, highlighting its potential to facilitate future research in fine-grained multimodal learning. We will release the dataset, annotation codebase, and processing pipelines to support community research.

Downloads

SlidesPaperTranscript English (automatic)

Next from AAAI 2026

Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation
technical paper

Bridging Day and Night: Target-Class Hallucination Suppression in Unpaired Image Translation

AAAI 2026

Shuwei Li and 2 other authors

25 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved