AI Companies Are Running Out of Training Data

Data is the vital force of large AI models, and thus of the industry itself. But it's also a finite resource — and companies could run out. — *Image: Getty / Futurism*

Data plays a central role, if not the central role, in the AI economy. Data is a model’s vital force, both in basic function and in quality; the more natural — as in, human-made — data that an AI system has to train on, the better that system becomes.

Unfortunately for AI companies, though, it turns out that natural data is a finite resource — and if that tap runs dry, researchers warn they could be in for a serious reckoning.

As Rita Matulionyte, an information technology law professor at Australia’s Macquarie University, notes in an essay for The Conversation, AI researchers have been sounding the dwindling-data-supply-alarm-bells for nearly a year. One study last year by researchers at the AI forecasting organization Epoch AI estimated that AI companies could run out of high-quality textual training data by as soon as 2026, while low-quality text and image data wells could run dry anytime between 2030 and 2060.

It’s a precarious situation for AI firms, given how much data AI systems need to operate and improve. AI models have advanced drastically as developers have poured in more and more data. If the data supply stagnates, so may the models — and thus, perhaps, the industry.

Though Matulionyte offers the use of synthetic data — or data generated by AI models — to train new models as a possible mitigation technique for data-hungry AI companies, that might not be a viable solution either. Indeed, using synthetic content might actually wreck a given model entirely; there’s some research to show that training AI models on AI-generated content causes a distinct inbreeding effect, with the lack of variance in the dataset resulting in garbled, uncanny outputs. (That said, as Matulionyte points out, some companies are already experimenting with synthetic training sets.)

As it stands, the most practical solution for this looming problem — save for the advent of mass human content farms, where we lowly carbon-based creatures click and clack away to feed the endless data thirst of our robot overlords — may actually be through data partnerships. Basically, a company or institution with a vast and sought-after trove of high-quality data strikes a deal with an AI company to cough up that data, likely in exchange for cash.

“Modern AI technology learns skills and aspects of our world — of people, our motivations, interactions, and the way we communicate — by making sense of the data on which it’s trained,” reads a recent blog post from leading Silicon Valley AI firm OpenAI, which launched a new Data Partnership just last week. “Data Partnerships are intended to enable more organizations to help steer the future of AI,” the blog continues, “and benefit from models that are more useful to them, by including content they care about.”

Considering that most of the AI datasets that are currently being used to train AI systems are made from internet-scraped data originally created by, well, all of us online, data partnerships may not be the worst way to go. But as data becomes increasingly valuable, it’ll certainly be interesting to see how many AI companies can actually compete for datasets — let alone how many institutions, or even individuals, will be willing to cough their data over to AI vacuums in the first place.

But even then, there’s no guarantee that the data wells won’t ever run dry. As infinite as the internet seems, few things are actually endless.

More on AI and training data: When AI Is Trained on AI-Generated Data, Strange Things Start to Happen