BMJ Open
Confident answers, shaky facts: popular AI chatbots put to the test

Clinical takeaway: Patients increasingly turn to AI chatbots for health information—but this study shows these tools frequently provide misleading or incomplete guidance. Clinicians should anticipate chatbot‑driven misinformation, particularly in high‑risk areas such as nutrition, stem cell therapies, and performance supplements, and reinforce evidence‑based recommendations while clarifying the limits of AI‑generated advice.
In this large cross‑sectional audit, researchers evaluated five widely used, publicly available AI chatbots—ChatGPT, Gemini, DeepSeek, Meta AI, and Grok—across 250 responses spanning five misinformation‑prone domains: cancer, vaccines, stem cells, nutrition, and athletic performance.
Overall, 49.6% of responses were deemed problematic, including 30% rated “somewhat problematic” and 19.6% rated “highly problematic,” based on independent, safety‑focused evaluations by domain experts using predefined criteria. Performance didn't differ significantly between chatbots overall, although Grok generated a higher‑than‑expected number of highly problematic responses. The authors note that Grok is unique in being partially trained on content from X (formerly Twitter), a platform known for rapid spread of health misinformation. Accuracy varied by topic: chatbots performed best in vaccines and cancer, but even in these categories, 22% and 26% of responses, respectively, were still rated problematic.
Despite frequent errors, chatbots consistently responded with high confidence, rarely refusing to answer even when questions implied unsafe or unproven treatments. Only 2 of 250 prompts (0.8%) resulted in refusals, underscoring how infrequently these tools default to caution.
Citation quality was also poor. While chatbots commonly supplied references, many citations were incomplete or fabricated, with a median citation completeness of just 40% and no chatbot producing a fully accurate reference list. The authors emphasize that these systems generate responses by predicting likely word sequences rather than evaluating scientific evidence, which may explain why inaccurate information is often delivered in an authoritative tone. Responses were also difficult to read, with all models averaging a college‑level reading complexity, above recommended levels for public health information.
“Continued deployment of these tools without public education and oversight risks amplifying medical misinformation,” the authors conclude.
These findings highlight the need for clinicians to caution patients about relying on AI chatbot outputs for medical decisions and to clearly communicate the limitations of these tools when health questions arise.
Source: Tiller NB, et al. (2026, April 14). BMJ Open. Generative artificial intelligence–driven chatbots and medical misinformation: an accuracy, referencing and readability audit