Content not yet available

This lecture has no active video or poster.

AAAI 2026

January 25, 2026

Singapore, Singapore

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Recently, multimodal large language models (MLLMs) have achieved significant advancements across various domains, and corresponding evaluation benchmarks have been continuously refined and improved. In this process, benchmarks in the scientific domain have played an important role in assessing the reasoning capabilities of MLLMs. However, existing benchmarks still face three key challenges: \textbf{1)} Insufficient evaluation of models' reasoning abilities in multilingual scenarios; \textbf{2)} Inadequate assessment of MLLMs' comprehensive modality coverage; \textbf{3)} Lack of fine-grained annotation of scientific knowledge points. To address these gaps, we propose MME-SCI, a comprehensive and challenging benchmark. We carefully collected 1,019 high-quality question-answer pairs, which involve 3 distinct evaluation modes. These pairs cover four subjects, namely mathematics, physics, chemistry, and biology, and support five languages: Chinese, English, French, Spanish, and Japanese. We conducted extensive experiments on 16 open-source models and 4 closed-source models, and the results demonstrate that MME-SCI is widely challenging for existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini achieved accuracy of only 52.11\%, 24.73\%, 36.57\%, and 29.80\% in mathematics, physics, chemistry, and biology, respectively, indicating a significantly higher difficulty level compared to existing benchmarks. More importantly, using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed existing models' performance in depth and identified their weaknesses in specific domains. For example, in questions related to Magnetic Field'', o4-mini correctly answered only 5 out of 33 questions, thereby fine-grainedly exposing the model's vulnerabilities. These findings highlight the urgent need to enhance the scientific reasoning capabilities of MLLMs. Code and samples are available in the Supplementary Materials.

Downloads

Paper

Next from AAAI 2026

Subspace-Aware Graph Construction and Contrastive Alignment for Multimodal Recommendation with Large Language Models
poster

Subspace-Aware Graph Construction and Contrastive Alignment for Multimodal Recommendation with Large Language Models

AAAI 2026

+8
Xiaoxiao Chi and 10 other authors

25 January 2026

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Presentations
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2025 Underline - All rights reserved