NIH

Patient narratives challenge AI’s diagnostic capabilities

August 15, 2024

NIH researchers found that while large language models (LLMs) like ChatGPT can diagnose rare genetic diseases from textbook descriptions accurately, they struggle with patient-written summaries. By testing 10 LLMs, they found that the best model was only 21% accurate with real patient descriptions, with some models scoring as low as 1%. However, accuracies improved when LLMs were presented with standardized questions about the same conditions. These findings highlight the limitations of LLMs and the need for human oversight in AI healthcare applications.

TRENDING THIS WEEK

EPOCRATES CME

View Catalog

Clinical FAQs

Check out the answers to frequently asked questions about our clinical content.

Create Account Sign in