What Is Synthetic Data? Why AI Trained on AI Is the Next Big Thing (and Problem)

As AI companies start running out of training data, many are looking into synthetic data — but it remains unclear whether it will work. — *Image: Getty / Futurism*

Short Supply

As AI companies start running out of training data, many are looking into so-called “synthetic data” — but it remains unclear whether such a thing will ever work.

As the New York Times explains, synthetic data is — on its face, at least — a simple solution for the growing scarcity and other issues with AI training data. If AI can grow large on data generated by AI, it would not only solve the training data shortage, but could also eliminate the looming problem of AI copyright infringement, too.

But while companies like Anthropic, Google, and OpenAI are all working to try to create quality synthetic data, none have managed to do so quite yet.

Thus far, AI models built on synthetic data have tended to run into trouble. Australian AI researcher and podcaster Jathan Sadowski referred to the isssues as “Habsburg AI,” a reference to the deeply-inbred Habsburg dynasty and their ultra-prominent chins that signaled their family’s penchant for intermarriage.

As Sadowski tweeted last February, this term describes “a system that is so heavily trained on the outputs of other generative AI’s that it becomes an inbred mutant, likely with exaggerated, grotesque features” — much like, well, the Hapsburg jaw.

Last summer, Futurism interviewed another data researcher, Rice University’s Richard G. Baraniuk, about his term for this phenomenon: “Model Autophagy Disorder,” or “MAD” for short. It took only five generations of AI inbreeding for the model in the Rice research to “blow up,” as the professor put it.

Synthetic Solutions

The big question: can AI companies figure out a way to make synthetic data that doesn’t drive their systems nuts?

As the NYT explains, OpenAI and Anthropic — which was, notably, founded by former OpenAI employees who wanted to create more ethical AI — are experimenting with a sort of checks-and-balances system. The first model generates the data, and the second checks the data for accuracy.

Thus far, Anthropic has been the most candid about its use of synthetic data, admitting that it uses a “constitution” or list of guidelines to train out its two-model system and even that Claude 3, the latest version of its LLM, was trained on “data we generate internally.”

While it’s a promising concept, the synthetic data research thus far is anything but — and given that researchers don’t really know how AI works to begin with, it’s difficult to imagine them figuring out synthetic data anytime soon.

More on AI conundrums: The Person Who Was in Charge of OpenAI’s $175 Million Fund Appears to Be Fake