
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

poster
Assessing the Accuracy of ChatGPT to Answer Commonly Asked Patient Questions about Osteosarcoma
Introduction: As artificial intelligence (AI) continues to gain popularity among patients as an educational resource, it is crucial to understand whether AI can provide reliable information on orthopaedic conditions. While prior studies have investigated the utility of ChatGPT in providing information on common orthopaedic surgeries, they have yet to explore its utility for more rare and complex diseases. This study evaluates the accuracy and comprehensibility of responses produced by ChatGPT to answer common patient questions about osteosarcoma.
Methods: Ten frequently asked questions (FAQs) regarding osteosarcoma were compiled through a literature review and national society patient FAQ pages. ChatGPT (Version 3.5) was subsequently utilized to answer these questions. For each response, a detailed description was written based on relevant literature supporting or refuting the chatbot's claims. Responses were analyzed for accuracy and clarity using a previously validated scoring system for ChatGPT response accuracy and a modified DISCERN score. The responses were independently reviewed by three authors, and scores were averaged as a crowd-sourced scoring strategy. Readability was assessed using five published educational-level indices. FAQ compilation and scoring were completed in collaboration with two fellowship-trained orthopaedic oncology surgeons.
Results: ChatGPT's responses generally required moderate clarification, with a mean accuracy score of 3 (satisfactory but requiring moderate clarification). One response received a mean rating of 2 (satisfactory requiring minimal clarification), five responses received a rating of 2.5, and four responses received a rating of 3. The 10 responses received an average mean DISCERN score of 36 (classified as poor, 28-38). The interrater reliability between the two orthopaedic oncology surgeons for the DISCERN criteria was 0.601, qualifying as moderate agreement. Readability level ranged from college graduate to 7th grade, exceeding the recommendation for patient educational materials.
Conclusions: While moderately accurate, most responses regarding osteosarcoma required further clarification and were written at an inaccessible reading level. ChatGPT can therefore be considered a starting point for patient education on osteosarcoma to supplement traditional patient education strategies, but it should not replace professional medical advice. One major limitation of AI that was apparent in the present study is its inability to provide personalized recommendations. The chatbot does not account for varying clinical presentations and degrees of scientific literacy that might affect an individual patient's counseling and treatment options. Future research should apply similar methodologies to other AI platforms to more comprehensively investigate the breadth and accuracy of online resources available to patients.