
Premium content
Access to this content requires a subscription. You must be a premium user to view this content.

poster
Evaluation of GPT-4 Concordance with North American Spine Society Guidelines for Lumbar Fusion Surgery
Summary of Background Data Concordance with evidence-based medicine (EBM) guidelines is associated with improved clinical outcomes in spine surgery. The North American Spine Society (NASS) has published coverage guidelines on indications for lumbar fusion surgery, with a recent survey demonstrating a 60% concordance rate across its members. GPT-4 is the latest popular deep learning model that receives knowledge training across public databases including those containing EBM guidelines. There is a paucity of information regarding the utility of artificial intelligence in the clinical decision-making process for patients with lumbar pathology.
Purpose To assess GPT-4’s responses to validated clinical vignettes and evaluate them for concordance with NASS clinical guidelines and recommendations for lumbar fusion surgery.
Study Design Comparative Analysis and Narrative Review
Patient Sample Adult patients in the United States evaluated for lumbar spine fusion surgery.
Methods Seventeen well validated clinical vignettes with specific indications for or against lumbar fusion based on NASS criteria were obtained from a prior published research study. Each case included history of present illness, physical exam findings, previous treatments attempted, and definitive imaging interpretations. The cases were transcribed into a standardized prompt and entered into GPT-4 to obtain a decision whether fusion is indicated. The query for each case was repeated three times under identical conditions to evaluate the model’s inter-query reliability. If there was internal disagreement between the queries for a given case, GPT-4’s decision for that case was determined by majority rule. Queries were all entered in separate strings to ensure that no contextual memory in the software influenced subsequent responses. The investigator entering the prompts was blinded to the NASS-concordant decisions for the cases prior to complete data collection. GPT-4 inter-query reliability was assessed using the Fleiss’ Kappa statistic. Differences in decision-making between GPT-4 and NASS guidelines were analyzed with Chi-square analysis.
Results GPT-4 responses for 14/17 (82.4%) of the clinical vignettes were in concordance with NASS EBM lumbar fusion guidelines. There was a significant association in clinical decision-making when determining indication for spine fusion surgery between GPT-4 and NASS guidelines (χ² = 7.14; p < 0.01). There was substantial agreement among the sets of responses generated by GPT-4 for each clinical case (K = 0.71; p < .001).
Conclusions There is significant concordance between GPT-4 responses and NASS EBM indications for lumbar fusion surgery. Artificial intelligence and deep learning models may prove to be an effective adjunct tool for clinical decision-making and medical education within modern spine surgery practices.