JAMA Netw Open

LLMs struggle with clinical reasoning, despite strong final answers

April 14, 2026

Clinical takeaway: Large language models are not reliable for unsupervised clinical decision-making. Weak performance in differential diagnosis and uncertainty handling limits safe use beyond narrow, supervised tasks.

AI tools are increasingly marketed for clinical use, often highlighting high accuracy. This study tested whether those claims hold up across the full clinical workflow.

In an evaluation of 21 large language models (LLMs) using standardized clinical cases, performance varied by task. Models performed relatively well on final diagnosis and management but consistently struggled with differential diagnosis, where failure rates exceeded 80% across all models.

This gap is clinically important. Generating a differential diagnosis requires managing uncertainty and iteratively refining possibilities, which are core elements of clinical reasoning. Instead, models tended to collapse prematurely to a single answer, bypassing the diagnostic process clinicians rely on.

Overall accuracy appeared high (roughly 80%–90%) but masked these weaknesses. A more comprehensive scoring method showed wider variation and exposed gaps in reasoning that standard benchmarks miss. The analysis covered January through December 2025.

Even newer “reasoning-optimized” models performed better than earlier versions but did not resolve these core limitations. Improvements were incremental, not transformative.

The findings highlight a mismatch between how models are evaluated and how clinicians think. High performance on isolated tasks does not translate to reliable decision-making across a patient encounter.

“Off-the-shelf LLMs have not yet achieved the intelligence required for safe deployment and remain limited in demonstrating advanced clinical reasoning,” the authors conclude.

Source: Rao AS. JAMA Netw Open. 2026 Apr 13. Large language model performance and clinical reasoning tasks

LLMs struggle with clinical reasoning, despite strong final answers

EPOCRATES CME

Clinical FAQs