
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

poster
Accuracy Benchmark Highlights AI's Potential to Transform Medical Education and Curricula Development
Title: Accuracy Benchmark Highlights AI's Potential to Transform Medical Education and Curricula Development
Background: The impressive capabilities of modern large language models (LLMs) like ChatGPT have recently been demonstrated in the context of the United States Medical Licensing Examination (USMLE). The National Board of Medical Examiners (NBME) released a study showing that these models can pass actual USMLE exams, with numerous studies showcasing improvements with successive models. However, a comprehensive analysis of these models', including ChatGPT 4 Omni, performance across specific medical content domains remains absent, limiting our understanding of their full potential in medical education. Furthermore, establishing a benchmark for accuracy is crucial to underline the reliability of these models in providing accurate and dependable information.
Methods: Utilizing 750 clinical vignette-based multiple-choice questions (MCQs) provided by medical schools, the performance of successive iterations of the leading LLM, ChatGPT—namely, ChatGPT 3.5 (GPT-3.5), ChatGPT 4 (GPT-4), and ChatGPT 4 Omni (GPT-4o)—were evaluated across various USMLE disciplines, clinical clerkships, and clinical skills such as diagnostics and management. Accuracy assessments were conducted using a standardized protocol, with rigorous statistical analyses employed to compare and analyze the reliability of these models in responding to the MCQs.
Results: GPT-4o achieved a remarkable correct response rate of 90.4%, significantly outperforming both GPT-4 (81.1%), GPT-3.5's (60.0%), and the medical student average of 59.3% (95% CI: 58.3%-68.3%). In preclinical disciplines, GPT-4o performed best in Social Sciences (95.5%), Microbiology (92.3%), and Immunology (92.9%). GPT-4's highest accuracy was in Behavioral and Neuroscience (86.5%), and GPT-3.5's highest was in Behavioral and Neuroscience (76.9%). In Clinical Clerkship categories, GPT-4o attained perfect scores in Family Medicine (100%) and Internal Medicine (100%), while GPT-4's highest was in Internal Medicine (95.5%), and GPT-3.5's highest was in Neurology (69.5%).
Conclusion: This study highlights the exceptional performance of the ChatGPT series across USMLE preclinical and clinical disciplines, showcasing the transformative potential of AI in medical education. The advancements seen in ChatGPT 4 Omni underscore its utility as an educational tool, suggesting a future where AI enhances learning and competency in medical fields. This study benchmarks groundbreaking accuracy, ushering in newfound reliability in the accuracy of AI. As AI integration in education grows, our findings emphasize the necessity for a formal curriculum to guide proper usage, ensuring oversight and validation to maximize benefits while maintaining educational standards.