Science
Large language model tops physicians on complex diagnostic reasoning tasks

Clinical takeaway: Advanced LLMs may now provide high-quality diagnostic and management second opinions—especially early in care when information is limited—but should be viewed as adjuncts pending prospective validation.
Diagnostic error remains a major source of patient harm; tools that improve early clinical reasoning could meaningfully impact outcomes, efficiency, and access to care.
A new Science study finds that an advanced large language model (OpenAI o1) consistently outperformed physicians across a range of challenging diagnostic reasoning tasks, including real-world emergency department cases, raising the possibility that AI could meaningfully augment clinical decision-making.
Across 143 NEJM clinicopathologic conference cases, the model included the correct diagnosis in its differential 78.3% of the time and listed it first in 52% of cases; accuracy rose to 97.9% when near-miss diagnoses were included.
On diagnostic test selection, clinicians judged the model’s recommendation as “exactly right” in 87.5% of cases, with only 1.5% considered unhelpful.
In structured clinical reasoning assessments, the model achieved near-perfect performance: it earned a perfect score in 78 of 80 cases using a validated reasoning rubric, significantly outperforming attending and resident physicians as well as earlier AI systems.
The strongest gains were seen in management reasoning and real-world emergency care. In management vignettes, the model scored a median 89%, exceeding physicians using conventional resources (34%) or even GPT-4–assisted physicians (41%).
In 76 emergency department cases, blinded physician reviewers rated the model’s differential diagnoses as superior to or on par with attending physicians at all stages—particularly at initial triage, where correct or near-correct diagnoses were identified in 67.1% of cases vs 50.0%–55.3% for physicians.
“LLMs have eclipsed most benchmarks of clinical reasoning,” the authors wrote, emphasizing the performance gap was greatest “where there is the least information available,” such as early triage.
The authors caution that results are limited to text-based reasoning and call for prospective clinical trials to evaluate safety, integration, and real-world impact.
Source: Brodeur PG, et al. (2026, April 30). Science. Performance of a large language model on the reasoning tasks of a physician