Comparative Accuracy of Large Language Models in Diagnosing Complex Toxic Exposures and Guiding Management
Document Type
Conference Proceeding - Restricted Access
Publication Date
5-8-2026
Abstract
Recent advances in artificial intelligence (AI) have produced large language models (LLMs) capable of generating clinically relevant, context-aware text from free-text prompts. Their diagnostic performance in toxicology remains largely untested. This study prospectively compared the diagnostic accuracy of seven LLMs across diagnostically challenging toxicology scenarios.
In this prospective cross-sectional study, twenty-five complex or rare toxicology vignettes, selected by an academic faculty panel, were presented to each model as standardized patient descriptions. Each LLM generated a primary and differential diagnosis. Secondary outcomes included quality of pathophysiology explanations, recommendations for diagnostic testing, proposed treatment plans, identification of potential complications, and discussion of prognostic factors. Comparative performance across models for key categorical and continuous variables was analyzed using chi-square tests.
The highest diagnostic accuracy (96%) was observed with Perplexity, Gemini, and OpenEvidence, while the remaining four LLMs achieved accuracies between 92% and 88% (p=0.88). All models showed strong performance in recommending diagnostic testing (100-88%), initial treatment plans (100-84%), anticipating complications (100-96%), and estimating prognosis (92-80%). Perplexity and OpenEvidence performed best on targeted clinical question-answering and scenario-based reasoning, and MS Copilot particularly aided learners through structured, stepwise suggestions. DeepSeek and Gemini lacked embedded evidence-linked reference layers and were more prone to unsupported recommendations and GPT 4.0 occasionally generated specific drug doses or antidote details that required careful verification.
Across diagnostically challenging toxicology vignettes, all evaluated LLMs demonstrated high diagnostic accuracy and generally strong performance in supporting key aspects of toxicology reasoning. These findings suggest that current LLMs may serve as valuable adjuncts for education and clinical decision support in toxicology, but variability in evidence linkage and occasional unsupported or overly specific therapeutic recommendations highlight the need for careful human oversight.
Recommended Citation
Brown J, Schwab E, Hoenke T, Lindemann E, O'Brien C, Sadyora J, Lewis B, Padley M, Jones J. Comparative accuracy of large language models in diagnosing complex toxic exposures and guiding management. Presented at: Research Day Corewell Health West; 2026 May 8; Grand Rapids, MI.
Comments
2026 Research Day Corewell Health West, Grand Rapids, MI, May 8, 2026. Abstract 1886