Research published in the Journal Royal Society Open Science and reported by Live Science found that researchers found that new versions of these AI models are likely to oversimplify complex information, but can also distort important scientific discoveries. Their brief attempts can be drastic enough to put misinformed health professionals, policymakers, and the general public at risk.
From summary to misleading
The study, led by postdoctoral researcher Uwe Peters at the University of Bonn, evaluated over 4,900 summaries produced by 10 of the most popular LLMs, including four versions of ChatGPT, three from Claude, two from Llama, and one from Deepseek. These were compared to human-generated summaries of academic research.
The outcome was tough. The summary generated by chatbots was almost five times more likely than the human summary, which overdoed the findings. And when I was encouraged to prioritize accuracy over simplicity, the chatbots didn’t get any better. It’s getting worse. In fact, when specifically asked to be accurate, they were twice as likely to produce a misleading summary.
“Generalization can seem benign or useful until you realize that it changes the meaning of the original study,” Peters explained in an email to Live Science. What’s even more concerning is that the problem appears to be growing. The newer the model, the more confident it is risking it being offered, but it is subtly wrong, but the information provides information.
When safe research becomes a medical command
In one impressive example of this study, Deepseek changed his careful phrase. “It was safe and successfully implemented,” a bold, unqualified medical recommendation. “A safe and effective treatment option.” Another summary by Llama eliminates important qualifying around doses and frequency of diabetes medications, and can lead to dangerous misconceptions when used in real-world medical settings. “Bias can take a more subtle form, like quiet inflation in the scope of claims,” said Max Rollwage, vice president of AI and vice president of research at Limbic, a clinical mental health AI company. He added that the AI overview has already been integrated into healthcare workflows, increasing more important accuracy.
Why is LLM doing this so wrong?
Part of the problem comes from the way LLM trains. Patricia Thane, co-founder and CEO of Private AI, points out that many models learn from simplified scientific journalism rather than peer-reviewed academic papers. This means inheriting and replicating those excessive simplifications, especially when you are already entrusted with summarizing simplified content.
More importantly, these models are often deployed in specialized domains such as medicine and science without the supervision of a professional. “It’s a fundamental misuse of technology,” Thaiine told Live Science, emphasizing task-specific training and surveillance is essential to prevent real-world harm.
Part of the problem comes from the way LLM trains. Patricia Thane, co-founder and CEO of Private AI, points out that many models learn from simplified scientific journalism rather than peer-reviewed academic papers. (Image: istock)
The bigger issues of AI and science
Peters compares the problem with using each version of each version of the copy. LLMS processes information through complex computational layers, often trimming the subtle limitations and contexts essential in the scientific literature.
Previous versions of these models were more likely to refuse to answer difficult questions. Ironically, the new model is more capable and “directable” and is confidently wrong.
“As their use continues to grow, this poses a real risk of massive misconceptions of science at a moment when public trust and scientific literacy are already under pressure,” Peters warned.
Guardrails are not applicable
Though the authors of this study acknowledge some limitations, including the need to expand the test to non-English textbooks and to expand to scientific claims that findings should be wake-up calls. Developers should create workflow safeguards that over-flagrate simplifications and ensure that false summary is reviewed and are not mistaken for expert-approved conclusions.
Ultimately, takeaways are clear. As impressive as AI chatbots might be visible, their summaries are undoubtedly masked easily, with little room for error when it comes to science and medicine.
Because in the world of AI-generated science, some extra words, or missing, can mean the difference between informed progress and dangerous misinformation.