Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
The ability to critique is vital for models to self-improve and serve as reliable AI assistants. While extensively studied in language-only settings, multimodal critique of Large Multimodal Models (LMMs) remains underexplored despite their growing capabilities in tasks like captioning and visual reasoning. In this work, we introduce MM-Critic, a holistic benchmark for evaluating the critique ability of LMMs across three dimensions: basic, correlation, and comparison. Covering 8 task types and over 500 tasks, MM-Critic collects responses from LMMs with various model sizes. To enhance the evaluation reliability, we design expert-informed scoring rubrics that guide GPT-4o in annotating responses and generating reference critiques, which serve as anchors for trustworthy judgments. Extensive experiments validate the effectiveness of MM-Critic and provide a comprehensive assessment of leading LMMs’ critique capabilities. Further analysis reveals key insights, including the correlation between response quality and critique, and varying critique difficulty across evaluation dimensions.