Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Multimodal Large Language Models (MLLMs) have shown advanced performance in vision-language tasks. However, existing multimodal reasoning models often suffer from excessive reasoning steps, leading to high computational costs and inefficiency. In this paper, we propose the Multimodal Adaptive Reasoning Model (MARS), which enables adaptive adjustment of the reasoning strategy based on question difficulty. Specifically, MARS adopts a three-stage training framework based on our constructed training dataset (MART): 1) CoT Masking Learning to enhance reasoning logicality by predicting masked reasoning steps. 2) Adaptive Reasoning Instruction Learning to train the model to skip or keep reasoning steps according to difficulty levels. 3) CoT Lightweight Reinforcement Learning with the Information Bottleneck Principle based GRPO algorithm to reduce CoT length while maintaining performance and generalizability. Results on both in-domain and out-of-domain datasets show that MARS significantly reduces the CoT length (90.2% decrease) while improving accuracy (0.54%), outperforming existing SOTA open-source and proprietary MLLMs.