Accuracy of Multimodal Large Language Models vs. Human Consensus in Interpreting Clinical Vignettes

Sarah Wagner

2025 AMA Research Challenge – Member Premier Access

•

October 22, 2025

•

Virtual only, United States

Would you like to see your presentation here, made available to a global audience of researchers?
Add your own presentation or have us affordably record your next conference.

Background

As large language models (LLMs) gain multimodal capabilities to interpret both images and text, their potential to assist in clinical decision-making continues to grow. While prior research has shown LLMs can match and even outperform individual physicians on standardized exams, little is known about their performance relative to collective human reasoning where varied perspectives contribute to more accurate conclusions. This study evaluates the performance of state-of-the-art multimodal LLMs in solving complex medical challenges compared to both individual and aggregate responses from a large pool of medically literate readers.

Methods

We compiled 100 ophthalmology-related cases from the New England Journal of Medicine (NEJM) Image Challenge archive (2006–2024), each containing a clinical vignette, image, and multiple-choice question. Six multimodal LLMs (GPT-4 Turbo, GPT-4o, GPT-4o mini, Gemini 1.5 Flash, Gemini 1.5 Pro, and Claude 3.5 Haiku) were prompted to answer each case and self-report confidence (1–4 scale). Human performance was assessed using (1) the average accuracy of NEJM respondents and (2) the accuracy of the most commonly selected (modal) answer per case, representing collective judgment. LLM accuracy and confidence were compared using Kruskal-Wallis and post hoc Dunn’s tests (p < 0.05 considered significant).

Results

All LLMs except Claude 3.5 Haiku outperformed the average individual respondent (mean human accuracy: 50.4%, 95% CI: 40.8–60.0%), with GPT-4o achieving the highest model accuracy at 72.0% (95% CI: 66.7–76.8%). However, the collective human response—the most frequently selected answer—was correct in 93.0% of cases (95% CI: 86.3–96.6%), significantly outperforming every LLM. The most popular response across LLMs (i.e., the answer most frequently selected by the models) was correct in 65.3% of cases (95% CI: 56.0–74.7), significantly lower than the collective human accuracy. Correct answers were associated with higher confidence scores for all LLMs except Gemini Flash, suggesting self-rated confidence may be a useful proxy for model reliability.

Conclusion

Multimodal LLMs can outperform individual human respondents on image-based diagnostic tasks but fall short of the accuracy achieved by collective human reasoning. This underscores a key strength of human cognition: the diversity of thought across individuals produces more accurate outcomes than any single decision-maker, human or machine. In contrast, LLMs often rely on a uniform reasoning process that can propagate the same errors consistently. While LLMs hold promise as diagnostic aids, further refinement may be required for them to match the collective intelligence of human experts in interpreting clinical vignettes.