Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
The data scaling law has significantly enhanced large multi-modal models (LMMs) performance across various downstream tasks. However, in the domain of perceptual video quality assessment (VQA), the potential of data scaling remains unprecedented due to the scarcity of labeled resources and the insufficient scale of datasets. To address this, we propose \textbf{OmniVQA}, a framework designed to efficiently build high-quality, machine-dominated synthetic multi-modal instruction databases (MIDBs) for VQA. We then scale up to create OmniVQA-Chat-400K, the largest dataset in the VQA field concurrently. Our focus is on the technical and aesthetic quality dimensions, with abundant in-context instruction data to provide fine-grained VQA knowledge. Additionally, we build the OmniVQA-MOS-20K dataset to enhance the model's quantitative quality rating capabilities. We then introduce a complementary training strategy that effectively leverages the knowledge from datasets for different tasks. Furthermore, we propose the OmniVQA-FG (fine-grain)-Benchmark to evaluate the fine-grained performance of models. Our results demonstrate that our models achieve state-of-the-art performance.
