Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Current Multimodal Chain-of-Thought (MCoT) methods suffer from low-quality multimodal reasoning, characterized by overthinking on simple queries and inefficient utilization of visual information, resulting in vast inefficient and ineffective computations. In this paper, we discover that Multimodal Large Language Models (MLLMs) possess inherent capabilities to distinguish between simple and difficult queries and enhance task-related visual information, which remain underutilized by existing approaches. Based on this insight, we propose Self-Driven Refined Multimodal CoT (SDR-MCoT), a training-free framework that mitigates these issues through two self-driven modules. First, our selective thinking module employs entropy-based confidence estimation to determine whether queries require detailed reasoning, preventing overthinking on simple questions. Second, our step-wise visual enhancement module strengthens attention to relevant visual regions at each reasoning step without inserting additional tokens, achieving fine-grained visual grounding and enhancement with minimal overhead. Moreover, SDR-MCoT can be seamlessly integrated into various MLLMs, offering a practical solution for improving multimodal reasoning. Comprehensive experiments across eight benchmarks from diverse domains (multimodal reasoning, visual understanding, hallucination, and mathematical reasoning) demonstrate that SDR-MCoT consistently outperforms existing MCoT methods on four different base models with reduced overhead. For instance, on Qwen2-VL-7B, our method improves average accuracy by over 6\% while reducing token consumption by approximately 60\% compared to zero-shot CoT.
