AI Loses Its Mind After Being Trained on AI-Generated Data

AI's kryptonite might just be... AI.

In a fascinating new paper, scientists at Rice and Stanford University found that feeding AI-generated content to AI models seems to cause their output quality to erode. Train generative AI models — large language models and image generators both included — enough AI-spun stuff, it seems, and this ouroboros-like self-consumption will break the model's digital brain.

Or, according to these scientists, it will drive the model "MAD."

"Seismic advances in generative AI algorithms for imagery, text, and other data types has led to the temptation to use synthetic data to train next-generation models," the researchers write. "Repeating this process creates an autophagous ('self-consuming') loop whose properties are poorly understood."

"Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease," they added. "We term this condition Model Autophagy Disorder (MAD)."

In other words, without "fresh real data" — translation: original human work, as opposed to stuff spit out by AI — to feed the beast, we can expect its outputs to suffer drastically. When trained repeatedly on synthetic content, say the researchers, outlying, less-represented information at the outskirts of a model's training data will start to disappear. The model will then start pulling from increasingly converging and less-varied data, and as a result, it'll soon start to crumble into itself.

The term MAD, as coined by the researchers, represents this self-swallowing process.

Take the results with a grain of salt, as the paper is yet to be peer-reviewed. But even so, the results are compelling. As detailed in the paper, the AI model tested only made it through five rounds of training with synthetic content before cracks in the outputs began to show.

Cool paper from my friends at Rice. They look at what happens when you train generative models on their own outputs…over and over again. Image models survive 5 iterations before weird stuff happens.https://t.co/JWPyRwhW8o
Credit: @SinaAlmd, @imtiazprio, @richbaraniuk pic.twitter.com/KPliZCABd4
— Tom Goldstein (@tomgoldsteincs) July 7, 2023

And if it is the case that AI does, in fact, break AI, there are real-world implications.

As the many active lawsuits against OpenAI make very clear, AI models have widely been trained by scraping troves of existing online data. It's also been generally true that the more data you feed a model, the better that model gets. As such, AI builders are always hungry for more training material — and in an age of an increasingly AI-filled web, that data scraping will get more and more precarious. And meanwhile, AI is being used by the masses and by major companies like Google to generate content, while the folks at Google and Microsoft have embedded AI into their search services as well.

That's the long way of saying that AI is already deeply intertwined with our internet's infrastructure. It's creating content, attempting to parse through content, and it's swallowing content, too. And the more synthetic content there is on the internet, the harder it will likely be for AI companies to ensure that their training datasets steer clear of it — potentially leaving the quality and structure of the open web hanging in the balance.

"Since the training datasets for generative AI models tend to be sourced from the Internet, today's AI models are unwittingly being trained on increasing amounts of AI-synthesized data," the researchers write in the paper, adding that the "popular LAION-5B dataset, which is used to train state-of-the-art text-to-image models like Stable Diffusion, contains synthetic images sampled from several earlier generations of generative models."

"Formerly human sources of text are now increasingly created by generative AI models, from user reviews to news websites, often with no indication that the text is synthesized," they add. "As the use of generative models continues to grow rapidly, this situation will only accelerate."

Concerning indeed, although fortunately, as Francisco Pires points out for Tom's Hardware, there could be ways to somewhat curb this future, where the whole internet world goes MAD alongside AI models, particularly in regard to adjusting model weights.

The results of the paper also raise the question of how useful these systems really are without human input. From the results shown here, the answer seems to be not very useful at all. And in a way, that feels a bit hopeful. See, machines can't replace us entirely — their brains will melt!

But then again, that might not be so hopeful after all. When AI takes over the world, maybe it won't kill humans; perhaps it'll just corral us into content farms, where we'll all be forced to write listicles about the "Star Wars" franchise and sacrifice our family recipes to Botatouille to keep the models running without collapsing.

More on AI training: OpenAI Sued for Using Everybody's Writing to Train AI

Share This Article