AI Developers Are Quietly Training AI Using AI-Generated Data

"Human-created data... is extremely expensive."

Self-Fulfilling

While most AI models are built on data made by humans, some companies are starting to use — or are trying to figure out how to use — data that was itself generated by AI. If they can pull it off, it could be a huge boon, albeit one that makes the entire AI ecosystem feel even more like a sort of algorithmic ouroboros.

As the Financial Times reports, companies including OpenAI, Microsoft, and the two-billion-dollar startup Cohere are increasingly investigating what's known as "synthetic data" to train their large language models (LLMs) for a number of reasons, not least of which being that it's apparently more cost-effective.

"Human-created data," Cohere CEO Aiden Gomez told the FT, "is extremely expensive."

Beyond the relative cheapness of synthetic data, however, is the scale issue. Training cutting-edge LLMs starts to use essentially all the human-created data that's actually available, meaning that to build even stronger ones, they're almost certainly going to need more.

"If you could get all the data that you needed off the web, that would be fantastic," Gomez said. "In reality, the web is so noisy and messy that it’s not really representative of the data that you want. The web just doesn’t do everything we need."

It's All Happening

As the CEO noted, Cohere and other companies are already quietly using synthetic data to train their LLMs "even if it’s not broadcast widely," and others like OpenAI seem to expect to use it in the future.

During an event in May, OpenAI CEO Sam Altman quipped that he is "pretty confident that soon all data will be synthetic data," the report notes, and Microsoft has begun publishing studies about how synthetic data could beef up more rudimentary LLMs. There are even startups whose whole purpose is selling synthetic data to other companies, the report notes.

There is a downside, of course: as critics point out, the integrity or reliability of AI-generated data could easily be called into question given that even AIs trained on human-generated material are known to make major factual errors and mistakes. And the process could generate some messy feedback loops. Researchers at Oxford and Cambridge call these potential problems "irreversible defects" in a recent paper, and it's not hard to see why.

Overall, the moonshot that companies like Cohere are working toward is self-teaching AIs that generate their own synthetic data.

"What you really want is models to be able to teach themselves," Gomez said. "You want them to be able to... ask their own questions, discover new truths and create their own knowledge. That’s the dream."

More on AI: Fully AI-Generated Influencers Are Getting Thousands of Reactions Per Thirst Trap

Share This Article

AI Developers Are Already Quietly Training AI Models Using AI-Generated Data

"Human-created data... is extremely expensive."

Self-Fulfilling

It's All Happening