
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

poster
Development of a Benchmarking Dataset for Symptom Detection Using Large Language Models
Background Longitudinal symptoms monitoring is challenging, which hinders supportive care delivery, quality improvement, and research efforts. Over half of oncology patients’ emergency department visits are preventable, many caused by symptom exacerbations. Patient conversations contain valuable information about symptoms, but this data is not well-captured in medical documentation. The combination of Large Language Models (LLMs) and ambient audio recording technology could enable robust monitoring of patient symptoms from clinical conversations, surpassing traditional NLP techniques. While there are various benchmarks to assess LLMs in medical question answering, there is a lack of standardized benchmarks to evaluate their performance in real-world symptom detection use cases. This study aims to test the performance of four LLMs to capture symptoms discussed in transcribed conversations from standardized clinical encounters. Methods This study demonstrates a novel methodology for evaluating LLMs on their ability to extract symptoms from published simulated patient conversations. We created a gold standard, double-coded transcript dataset (n=264) annotated for 16 specific symptoms plus a catch-all ‘other’ category. Four models (GPT 3.5 turbo, GPT-4, GPT-4 turbo, and GPT-4o) were assessed on the quality of their zero-shot symptom extractions. Models were evaluated for task performance (precision, recall, accuracy, F1 score). Statistical significance was calculated using Cochran's Q, followed by pairwise post hoc McNemar tests with a Bonferroni correction. Results 3,085 transcript excerpts were annotated for presence or absence of symptoms. 2,087 excerpts contained symptoms which were further annotated for 16 symptoms of interest. The three most common symptoms were pain, cough, and shortness of breath. GPT-4 showed the highest overall performance for symptom talk detection F1=0.94 (p<0.05). Detection of the three most common symptoms achieved F1 scores ranging from 0.74 to 0.89 across models. Conclusion We developed a patient transcript benchmarking dataset for symptoms and demonstrated LLM evaluation using this dataset. This study highlights the effectiveness of GPT-4 in symptom detection, showing the best overall performance among the four evaluated models (p<0.05). The high F1 scores achieved for detecting the three most common symptoms (pain, cough, and shortness of breath) across models support the potential of LLMs in clinical NLP tasks. As LLM- based data extraction is institution-agnostic, these models may provide a more generalizable and transferable approach to challenging NLP tasks in healthcare. Future research includes exploring benchmarking on real patient data and practical considerations of implementing LLM-augmented symptom tracking into current workflows.