
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

poster
Evaluating the Utility of a Large Language Model in Simulation-Based Medical Education
Background Clinical simulations have an important role in medical education, allowing students to gain realistic clinical experience in a didactic manner to improve patient interaction and clinical reasoning skills. However, access to simulations is often limited by resources including physical space in simulation centers and time with standardized patients. ChatGPT is a large language model (LLM) artificial intelligence software that is freely accessible online, that has garnered significant attention in the medical community for its ability to generate relatively accurate medical information. We evaluated the ability of ChatGPT to create high-fidelity virtual clinical simulations and provide simulated patient information and feedback across several clinical scenarios.
Methods We prompted ChatGPT to generate seven clinical scenarios (fractured arm/intimate partner violence, acute appendicitis, aortic dissection, acetaminophen overdose, acute seizure/metastatic lung cancer, COPD exacerbation, and tension pneumothorax) and create interactive simulations including simulated patient history, physical exam discoveries, lab findings, and imaging descriptions. Each scenario was concluded with feedback and repeated in triplicate to assess consistency. Simulation scenarios were graded for fidelity, evaluated by accuracy, realism, and consistency on a 5-point Likert scale (1-2 = poor, 2-3 = low, 3-4 = moderate, 4-5 = high) by three independent physicians holding the academic rank of Full Professor. A one-sample Wilcoxon signed-rank test was used to assess simulation performance based on a threshold criteria of greater than three out of five points. Fleiss’ Kappa was used to assess interrater reliability. All hypothesis testing was performed using a significance level of p<0.05.
Results Of the seven simulations evaluated, four were rated high fidelity, two were rated borderline high fidelity, and one was rated moderate fidelity regarding simulation accuracy, realism, and consistency. The mean overall score was 4.0, indicating overall high accuracy, realism, and consistency, significantly higher than hypothesized threshold of 3.0 (p < 0.001). There was a high degree of interrater agreement (ordinal weighted Fleiss κ, 0.763; 95%CI: 0.25-1.0, p<0.05).
Conclusion Overall, ChatGPT was able to consistently generate accurate and consistent virtual clinical simulations. Our study highlights a framework for a novel use case of LLMs in medical education and demonstrates its utility in providing increased accessibility to simulations for medical students. Although LLMs have known limitations in information accuracy, with the rapid advancement of artificial intelligence technology and advent LLMs that can be trained on specific resources, the use of LLMs such as ChatGPT for further simulation-based medical education represents a promising frontier for medical education.