Key Summary
- Five popular platforms, ChatGPT, Gemini, Meta AI, Grok and DeepSeek, were used for the study.
- They were asked questions across five categories - cancer, vaccines, stem cells, nutrition, and athletic performance.
- Reference quality was noted to be poor, with an average completeness score of 40 per cent.
An analysis of five popular chatbots' responses to health and medicine questions shows that many of them were inaccurate and incomplete.
This highlights the health risks faced by users who are increasingly relying on these platforms.
The findings, published in The British Medical Journal (BMJ) Open, show that nearly half of the responses were problematic as they were presenting a false equivalence between science and non-science-based claims.
Researchers from the UK, US, and Canada evaluated five popular platforms - ChatGPT, Gemini, Meta AI, Grok and DeepSeek - by asking each of them 10 open-ended and closed questions across five health categories - cancer, vaccines, stem cells, nutrition, and athletic performance.
"The audited chatbots performed poorly when answering questions in misinformation-prone health and medical fields," the authors wrote.
"Nearly half (49.6 percent) of responses were problematic: 30 percent somewhat problematic and 19.6 percent highly problematic," they said.
Performance of the chatbots was found to be the strongest in topics of cancer and vaccines, and weakest in stem cells, athletic performance and nutrition.
Responses were consistently presented with confidence and certainty, with few caveats or disclaimers, the study found.
Reference quality was noted to be poor, with an average completeness score of 40 per cent.
Chatbot hallucinations - creating false information and presenting as fact - and fabricated citations meant that no chatbot provided a fully accurate reference list, the researchers said.
"Our findings regarding scientific accuracy, reference quality, and response readability highlight important behavioural limitations and the need to re-evaluate how AI chatbots are deployed in public-facing health and medical communication," the authors said.
"By default, chatbots do not access real-time data but instead generate outputs by inferring statistical patterns from their training data and predicting likely word sequences. They do not reason or weigh evidence, nor are they able to make ethical or value-based judgments," they said.
The researchers had designed prompts to resemble common 'information-seeking' health and medical queries, language used in misinformation online, and in academic discourse.
The prompts were also used to stress test and pick up behavioural vulnerabilities of AI models by 'straining' them towards misinformation or contraindicated advice.
The information in the responses was scored for accuracy and completeness, with particular attention given to whether a chatbot presented a false balance between science and non-science-based claims, regardless of the strength of the evidence.












