Ask the CEO of any AI startup, and you'll probably get an earful about the tech's potential to "transform work," or "revolutionize the way we access knowledge."
Really, there's no shortage of promises that AI is only getting smarter — which we're told will speed up the rate of scientific breakthroughs, streamline medical testing, and breed a new kind of scholarship.
But according to a new study published in the Royal Society, as many as 73 percent of seemingly reliable answers from AI chatbots could actually be inaccurate.
The collaborative research paper looked at nearly 5,000 large language model (LLM) summaries of scientific studies by ten widely used chatbots, including ChatGPT-4o, ChatGPT-4.5, DeepSeek, and LLaMA 3.3 70B. It found that, even when explicitly goaded into providing the right facts, AI answers lacked key details at a rate of five times that of human-written scientific summaries.
"When summarizing scientific texts, LLMs may omit details that limit the scope of research conclusions, leading to generalizations of results broader than warranted by the original study," the researchers wrote.
Alarmingly, the LLMs' rate of error was found to increase the newer the chatbot was — the exact opposite of what AI industry leaders have been promising us. This is in addition to a correlation between an LLM's tendency to overgeneralize with how widely used it is, "posing a significant risk of large-scale misinterpretations of research findings," according to the study's authors.
For example, use of the two ChatGPT models listed in the study doubled from 13 to 26 percent among US teens between 2023 and 2025. Though the older ChatGPT-4 Turbo was roughly 2.6 times more likely to omit key details compared to their original texts, the newer ChatGPT-4o models were nine times as likely. This tendency was also found in Meta's LLaMA 3.3 70B, which was 36.4 times more likely to overgeneralize compared to older versions.
The job of synthesizing huge swaths of data into just a few sentences is a tricky one. Though it comes pretty easily to fully-grown humans, it's a really complicated process to program into a chatbot.
While the human brain can instinctively learn broad lessons from specific experiences — like touching a hot stove — complex nuances make it difficult for chatbots to know what facts to focus on. A human quickly understands that stoves can burn while refrigerators do not, but an LLM might reason that all kitchen appliances get hot, unless otherwise told. Expand that metaphor out a bit to the scientific world, and it gets complicated fast.
But summarizing is also time-consuming for humans; the researchers list clinical medical settings as one area where LLM summaries could have a huge impact on work. It goes the other way, too, though: in clinical work, details are extremely important, and even the tiniest omission can compound into a life-changing disaster.
This makes it all the more troubling that LLMs are being shoehorned into every possible workspace, from high school homework to pharmacies to mechanical engineering — despite a growing body of work showing widespread accuracy problems inherent to AI.
However, there were some important drawbacks to their findings, the scientists pointed out. For one, the prompts fed to LLMs can have a significant impact on the answer it spits out. Whether this affects LLM summaries of scientific papers is unknown, suggesting a future avenue for research.
Regardless, the trendlines are clear. Unless AI developers can set their new LLMs on the right path, you'll just have to keep relying on humble human bloggers to summarize scientific reports for you (wink).
More on AI: Senators Demand Safety Records from AI Chatbot Apps as Controversy Grows
Share This Article