Three groups of icons representing people have shapes travelling between them...
This image shows an example of a Large Language Model (LLM) in use. Data is gathered from groups of people communicating with each other. A machine learning algorithm is used to select segments of data from each group of people and then aggregate that data into a neat output. It is not obvious to human observers why specific segments of data were selected by the algorithm, as represented by the irregular geometric shapes.

Image source: Yasmine Boudiaf & LOTI / Data Processing as part of a Better Images of AI workshop at Science Gallery London (CC-BY 4.0)

News • LLM exaggerations and overgeneralizations

Generative AI routinely blows up science findings

When summarizing scientific studies, large language models (LLMs) like ChatGPT and DeepSeek produce inaccurate conclusions in up to 73% of cases.

This is according to a new study published in the journal Royal Society Open Science by Uwe Peters (Utrecht University) and Benjamin Chin-Yee (Western University, Canada/University of Cambridge, UK). The researchers tested the most prominent LLMs and analyzed thousands of chatbot-generated science summaries, revealing that most models consistently produced broader conclusions than those in the summarized texts. Surprisingly, prompts for accuracy increased the problem and newer LLMs performed worse than older ones. 

Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they’ll get a more reliable summary. Our findings prove the opposite

Uwe Peters

The study evaluated how accurately ten leading LLMs, including ChatGPT, DeepSeek, Claude, and LLaMA, summarize abstracts and full-length articles from top science and medical journals (e.g., Nature, Science, and Lancet). Testing LLMs over one year, the researchers collected 4,900 LLM-generated summaries. Six of ten models systematically exaggerated claims found in the original texts often in subtle but impactful ways, for instance, changing cautious, past-tense claims like “The treatment was effective in this study” to a more sweeping, present-tense version like “The treatment is effective.” These changes can mislead readers into believing that findings apply much more broadly than they actually do. 

Strikingly, when the models where explicitly prompted to avoid inaccuracies, they were nearly twice as likely to produce overgeneralized conclusions than when given a simple summary request. “This effect is concerning,” Peters said: “Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they’ll get a more reliable summary. Our findings prove the opposite.” 

Recommended article

Photo

Article • From chatbot to medical assistant

Generative AI: prompt solutions for healthcare?

Anyone who has exchanged a few lines of dialogue with a large language model (LLM), will probably agree that generative AI is an impressive new breed of technology. LLMs show great potential in addressing some of the most urgent challenges in healthcare. At the Medica tradefair, several expert sessions were dedicated to generative AI, its potential medical applications and current caveats.

Peters and Chin-Yee also directly compared chatbot-generated to human-written summaries of the same articles. Unexpectedly, chatbots were nearly five times more likely to produce broad generalizations than their human counterparts. “Worryingly”, said Peters, “newer AI models, like ChatGPT-4o and DeepSeek, performed worse than older ones.” 

Why are these exaggerations happening? “Previous studies found that overgeneralizations are common in science writing, so it’s not surprising that models trained on these texts reproduce that pattern”, Chin-Yee noted. Additionally, since human users likely often prefer LMM responses that sound helpful and widely applicable, through interactions, the models may learn to favor fluency and generality over precision, Peters suggested. 

The researchers recommend using LLMs such as Claude, which had the highest generalization accuracy, setting chatbots to lower ‘temperature’ (the parameter fixing a chatbot’s ‘creativity’), and using prompts that enforce indirect, past-tense reporting in science summaries. Finally, “If we want AI to support science literacy rather than undermine it,” Peters said, “we need more vigilance and testing of these systems in science communication contexts.” 


Source: Utrecht University

14.05.2025

More on the subject:

Related articles

Photo

News • Breast cancer diagnosis

AI classifies mammography microcalcifications

A novel deep-learning approach automatically finds and classifies microcalcifications found in mammography images—bringing both accuracy and consistency to breast-cancer screening.

Photo

News • CT-based deep learning-driven tool

Liver cancer segmentation and detection with 'SALSA'

Results of a new retrospective study demonstrate the potential of a novel, CT-based deep learning-driven tool to enhance liver cancer diagnosis, treatment planning, and response evaluation.

Photo

News • From RRMS to SPMS

Multiple sclerosis: AI detects crucial transition point

Multiple sclerosis often transitions from a relapsing-remitting to secondary progressive form, which requires different treatment. Now, an AI model can determine this change with 90% certainty.

Related products

Subscribe to Newsletter