AI Models Show Signs of Falling Apart as They Ingest More AI-Generated Data

As CEOs trip over themselves to invest in artificial intelligence, there's a massive and growing elephant in the room: that any models trained on web data from after the advent of ChatGPT in 2022 are ingesting AI-generated data — an act of low-key cannibalism that may well be causing increasing technical issues that could come to threaten the entire industry.

In a new essay for The Register, veteran tech columnist Steven Vaughn-Nichols warns that even attempts to head off so-called "model collapse" — which occurs when large language models (LLMs) are fed synthetic, AI-generated data and consequently go off the rails — are another kind of nightmare.

As Futurism and countless other outlets have reported over the last few years, the AI industry has continuously barreled toward the moment at which all available authentic training data — that is, information that was produced by humans and not AI — will be exhausted. Some pundits, including Elon Musk, believe we're already there.

To circumvent this "Garbage In/Garbage Out" conundrum, industry titans including Google, OpenAI, and Anthropic have engaged in what's known as retrieval-augmented generation (RAG), which essentially involves plugging LLMs up to the internet so they can look things up if they're presented with prompts that don't have answers in their training data.

That concept seems pretty intuitive on its face, especially when presented with the specter of rapidly-approaching model collapse. There's only one problem: the internet is now full of lazy content that uses AI to drum up answers to common questions, often with hilariously bad and inaccurate results.

In a recent study from the research arm of Michael Bloomberg's media empire that was presented at a computational linguistics conference in April, 11 of the latest LLMs, including OpenAI's GPT-4o, Anthropic's Claude-3.5-Sonnet, and Google's Gemma-7B, produced far more "unsafe" responses than their non-RAG counterparts. As the paper put it, those safety concerns can include "harmful, illegal, offensive, and unethical content, such as spreading misinformation and jeopardizing personal safety and privacy."

"This counterintuitive finding has far-reaching implications given how ubiquitously RAG is used in [generative AI] applications such as customer support agents and question-answering systems," explained Amanda Stent, Bloomberg's head of AI research and strategy, in another interview with Vaughn-Nichols published in ZDNet earlier this month. "The average internet user interacts with RAG-based systems daily. AI practitioners need to be thoughtful about how to use RAG responsibly."

So if AI is going to run out of training data — or it has already — and plugging it up to the internet doesn't work because the internet is now full of AI slop, where do we go from here? Vaughn-Nichols notes that some folks have suggested mixing authentic and synthetic to produce a heady cocktail of good AI training data — but that would require humans to keep creating real content for training data, and the AI industry is actively undermining the incentive structures fo them to continue — while pilfering their work without permission, of course.

A third option, Vaughn-Nichols predicts, appears to already be in motion.

"We're going to invest more and more in AI, right up to the point that model collapse hits hard and AI answers are so bad even a brain-dead CEO can't ignore it," he wrote.

More on AI in crisis: Legendary Facebook Exec Scoffs, Says AI Could Never Be Profitable If Tech Companies Had to Ask for Artists' Consent to Ingest Their Work

Share This Article