When AI Is Trained With AI-Generated Data, It Starts Spouting Gibberish

What happens when you train an AI model with AI-generated content? Absolute chaos, according to a new study. — *Image: Getty / Futurism*

What happens when you feed AI-generated content back into an AI model? Put simply: absolute chaos.

A fascinating new study published in the journal Nature shows that AI models trained on AI-generated material will experience rapid “model collapse.” Basically, as an AI model cannibalizes AI-generated data, its outputs become increasingly bizarre, garbled, and nonsensical, as if synthetic data — as opposed to high-quality, human-made material — breaks its brain.

On the one hand, the study’s results serve as another reminder that AI models are incredibly responsive to their training data, and that allowing AI-generated material to seep into those datasets can have serious consequences for AI systems and the billion-dollar companies building them. At the same time, it underscores AI companies’ ever-growing need for high-quality human material with which to train its models — an increasingly scarce, and thus increasingly valuable, resource that could stand to put generative AI advancement at a plateau.

“The message is we have to be very careful about what ends up in our training data,” study co-author Zakhar Shumaylov, an AI researcher at the University of Cambridge, told Nature, warning that otherwise “things will always, provably, go wrong.”

Shumaylov’s team used a pre-trained large language model (LLM) which they then calibrated with a HuggingFace dataset comprised of Wikipedia entries. The researchers then put the model through a string of generations, each time returning the AI’s output back into the training set.

The results were striking. A prompt about buildings in Somerset, England — the text of which was taken from this niche Wikipedia page — for example, first returned a relatively normal, though still error-stricken, response. But by the researchers’ ninth iteration, the model’s response was total gibberish about… jackrabbit tails.

“architecture,” read the AI’s garbled output. “In addition to being home to some of the world’s largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @”

However bizarre the results, the process of model collapse is actually fairly simple. An AI system only has access to the data that it’s provided; more original, human-made data generally means a better-functioning generative AI system, as does diversity within that data. Conversely, feeding a model with AI-spun generations is diversity-limiting. The model will compound its own errors, forget certain words and artifacts that are less present in its training, and eventually cave in on itself.

The study’s authors aren’t the first to measure this phenomenon. AI researcher Jathan Sadowski last year dubbed the destructive process as “Habsburg AI,” wherein an AI model fed AI-made content essentially becomes an “inbred mutant,” much like Europe’s infamously-inbreeding Habsburg family inbred itself into infertility and decline. Indeed, similarly to how humans need genetic diversity in reproduction to avoid historically recessive jawlines, an AI model seems to need high-quality diversity in its training data to avoid collapse.

The study also raises another serious point of concern for data-desperate AI companies, which is the wavering sustainability of web scraping. AI models have largely been trained on data that was scraped from the open web and social media. Now, though, the internet is increasingly chock-full of AI-generated content. There are thousands of AI-powered, spammy “news” sites cropping up in Google; Facebook is quickly filling with bizarre AI imagery of soldiers and Jesus; established media companies have, in a growing number of cases, published AI-generated content on their websites. Very little of this content is marked as AI-generated, meaning that web scraping, should AI companies continue to attempt to gather their data from the digital wilds, is becoming a progressively dubious means of collecting AI training data.

“The need to distinguish data generated by LLMs from other data raises questions about the provenance of content that is crawled from the Internet,” the study authors write, adding that it’s “unclear how content generated by LLMs can be tracked at scale.”

If there’s any silver lining for AI companies, it’s that model collapse can be slowed down by infusing more original human data into a training set, according to the study. Still, the fact remains: AI models are hungry, and they need high-quality and original data. Can AI companies keep up with that demand?

More on AI training: AI Companies Running out of Training Data after Burning Through Entire Internet