Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.
Multimodal large language models (MLLMs) have demonstrated strong performance across diverse multimodal tasks, achieving promising outcomes. However, their application to emotion recognition in natural images remains under-explored. MLLMs struggle to handle ambiguous emotional expressions and implicit affective cues, whose capability is crucial for affective understanding but largely overlooked. To address these challenges, we propose MERMAID, a novel multi-agent framework that integrates an emotion-guided visual augmentation module and a multi-perspective self-reflection module, enabling agents to interact across modalities and reinforce subtle emotional semantics, thereby enhancing emotion recognition and achieving autonomy. Extensive experiments demonstrate that MERMAID outperforms existing methods, achieving up to 8.70%-27.90% absolute accuracy gains across diverse benchmarks and demonstrating greater robustness in emotionally diverse scenarios.