Large Language Models as Decision Support Tools in Pediatric Emergency Medicine: Diagnostic Accuracy and Clinical Reasoning

Document Type

Conference Proceeding - Restricted Access

Publication Date

5-8-2026

Abstract

Recent advances in artificial intelligence (AI) have enabled generative models to produce accurate, detailed text-based responses to written prompts. The purpose of this prospective study was to compare the accuracy of seven large language models (LLMs) in a series of diagnostically complex pediatric cases in emergency medicine (EM).

This prospective cross-sectional study evaluated the diagnostic performance of seven widely used large language models (LLMs) in October 2025. Each LLM was assessed using 25 complex or rare pediatric emergency medicine case scenarios selected by a panel of academic faculty. Patient case descriptions were provided to each model, which were then prompted to generate a diagnosis and a differential diagnosis. Secondary outcomes included pathophysiology, diagnostic testing, treatment plans, complications, and prognostic factors. Comparative analyses across LLMs were conducted using chi-square and ANOVA tests for key categorical and continuous variables.

The LLM with the highest diagnostic accuracy was Claude.ai (96%), followed by MS Copilot (92%), Perplexity (88%), and Gemini (80%). The remaining three LLMs had accuracies ranging from 76% to 72%. Accuracy differences were statistically significant across models (p < 0.05). All AIs showed promise in recommended diagnostic testing (88-84%), initial treatment plans (96-88%), anticipating complications (96-84%), and prognosis (92%-72%). OpenEvidence and Deep Seek exceeded other models in targeted clinical question-answering and scenario-based reasoning tasks. Microsoft Copilot was considered particularly valuable for students because it provided helpful suggestions and structured responses that facilitate learning. ChatGPT had notable limitations, including difficulty recognizing nuanced or atypical patient symptoms and errors in interpreting laboratory and imaging data.

Current LLMs show promise as decision-support and educational tools in pediatric emergency medicine, with several models approaching expert-level performance on complex cases. However, model-specific limitations and inconsistent reasoning underscore the need for cautious, supervised deployment, standardized benchmarking, and ongoing research to optimize their training, validation, and use in real-world pediatric care.

Comments

2026 Research Day Corewell Health West, Grand Rapids, MI, May 8, 2026. Abstract 1884

This document is currently not available here.

Share

COinS