Lecture image placeholder

Premium content

Access to this content requires a subscription. You must be a premium user to view this content.

Monthly subscription - $9.99Pay per view - $4.99Access through your institutionLogin with Underline account
Need help?
Contact us
Lecture placeholder background
VIDEO DOI: https://doi.org/10.48448/31wg-v985

poster

AMA Research Challenge 2024

November 07, 2024

Virtual only, United States

Development of a Benchmarking Dataset for Symptom Detection Using Large Language Models

Background Longitudinal symptoms monitoring is challenging, which hinders supportive care delivery, quality improvement, and research efforts. Over half of oncology patients’ emergency department visits are preventable, many caused by symptom exacerbations. Patient conversations contain valuable information about symptoms, but this data is not well-captured in medical documentation. The combination of Large Language Models (LLMs) and ambient audio recording technology could enable robust monitoring of patient symptoms from clinical conversations, surpassing traditional NLP techniques. While there are various benchmarks to assess LLMs in medical question answering, there is a lack of standardized benchmarks to evaluate their performance in real-world symptom detection use cases. This study aims to test the performance of four LLMs to capture symptoms discussed in transcribed conversations from standardized clinical encounters. Methods This study demonstrates a novel methodology for evaluating LLMs on their ability to extract symptoms from published simulated patient conversations. We created a gold standard, double-coded transcript dataset (n=264) annotated for 16 specific symptoms plus a catch-all ‘other’ category. Four models (GPT 3.5 turbo, GPT-4, GPT-4 turbo, and GPT-4o) were assessed on the quality of their zero-shot symptom extractions. Models were evaluated for task performance (precision, recall, accuracy, F1 score). Statistical significance was calculated using Cochran's Q, followed by pairwise post hoc McNemar tests with a Bonferroni correction. Results 3,085 transcript excerpts were annotated for presence or absence of symptoms. 2,087 excerpts contained symptoms which were further annotated for 16 symptoms of interest. The three most common symptoms were pain, cough, and shortness of breath. GPT-4 showed the highest overall performance for symptom talk detection F1=0.94 (p<0.05). Detection of the three most common symptoms achieved F1 scores ranging from 0.74 to 0.89 across models. Conclusion We developed a patient transcript benchmarking dataset for symptoms and demonstrated LLM evaluation using this dataset. This study highlights the effectiveness of GPT-4 in symptom detection, showing the best overall performance among the four evaluated models (p<0.05). The high F1 scores achieved for detecting the three most common symptoms (pain, cough, and shortness of breath) across models support the potential of LLMs in clinical NLP tasks. As LLM- based data extraction is institution-agnostic, these models may provide a more generalizable and transferable approach to challenging NLP tasks in healthcare. Future research includes exploring benchmarking on real patient data and practical considerations of implementing LLM-augmented symptom tracking into current workflows.

Next from AMA Research Challenge 2024

Efficacy of Osteopathic Manipulative Treatment on Headache Frequency and Intensity in Patients with Tension-Type Headaches: A Systematic Review and Meta-analysis
poster

Efficacy of Osteopathic Manipulative Treatment on Headache Frequency and Intensity in Patients with Tension-Type Headaches: A Systematic Review and Meta-analysis

AMA Research Challenge 2024

Fadia Barakzai

07 November 2024

Stay up to date with the latest Underline news!

Select topic of interest (you can select more than one)

PRESENTATIONS

  • All Lectures
  • For Librarians
  • Resource Center
  • Free Trial
Underline Science, Inc.
1216 Broadway, 2nd Floor, New York, NY 10001, USA

© 2023 Underline - All rights reserved