Wordsmiths, these AIs are not.
Disease Control
AI models may be trained on the entire corpus of humanity's writing, but it turns out their vocabulary can be strikingly limited. A new yet-to-be-peer-reviewed study, spotted by Ars Technica, adds to the general understanding that large language models tend to overuse certain words that can give their origins away.
In a novel approach, these researchers took a cue from epidemiology by measuring "excess word usage" in biomedical papers in the same way doctors gauged COVID-19's impact through "excess deaths." The results are a fascinating insight into AI's impact in the world of academia, suggesting that at least 10 percent of abstracts in 2024 were "processed with LLMs."
"The effect of LLM usage on scientific writing is truly unprecedented and outshines even the drastic changes in vocabulary induced by the COVID-19 pandemic," the researchers wrote in the study.
The work may even provide a boost for methods of detecting AI writing, which have so far proved notoriously unreliable.
Style Over Substance
These findings come from a broad analysis of 14 million biomedical abstracts published between 2010 and 2024 that are available on PubMed. The researchers used papers published before 2023 as a baseline to compare papers that came out during the widespread commercialization of LLMs like ChatGPT.
They found that words that were once considered "less common," like "delves," are now used 25 more times than they used to, and others, like "showcasing" and "underscores," saw a similarly baffling nine times increase. But some "common" words also saw a boost: "potential," "findings," and "crucial" went up in frequency by up to 4 percent.
Such a marked increase is basically unprecedented without the explanation of some pressing global circumstance. When the researchers looked for excess words between 2013 and 2023, the ones that came up were terms like "ebola," "coronavirus," and "lockdown."
Beyond their obvious ties to real-world events, these are all nouns, or as the researchers put it, "content" words. By contrast, what we see with the excess usage in 2024 is that they're almost entirely "style" words. And in numbers, of the 280 excess "style" words that year, two-thirds of them were verbs, and about a fifth were adjectives.
To see just how saturated AI language is with these tell-tales, have a look at this example from a real 2023 paper (emphasis the researchers'): "By meticulously delving into the intricate web connecting [...] and [...], this comprehensive chapter takes a deep dive into their involvement as significant risk factors for [...].
Language Barriers
Using these excess style words as "markers" of ChatGPT usage, the researchers estimated that around 15 percent of papers published in non-English speaking countries like China, South Korea, and Taiwan are now AI-processed — which is higher than in countries where English is the native tongue, like the United Kingdom, at 3 percent. LLMs, then, may be a genuinely helpful tool for non-native speakers to make it in a field dominated by English.
Still, the researchers admit that native speakers may simply be better at hiding their LLM usage. And of course, the appearance of these words is not a guarantee that the text was AI-generated.
Whether this will serve as a reliable detection method is up in the air — but what is certainly evidence here is just how quickly AI can catalyze changes in written language.
More on AI: AI Researcher Elon Musk Poached From OpenAI Returns to OpenAI
Share This Article