AI Companies Running Out of Training Data After Burning Through Entire Internet

AI companies are swiftly running into a massive problem: there isn't enough data on the internet to train the next generation of models. — *Image: Getty / Futurism*

Mass Shortage

As AI companies keep building bigger and better models, they’re running down a shared problem: sometime soon, the internet won’t be big enough to provide all the data they need.

As the Wall Street Journal reports, some companies are looking for alternative sources of data training now that the internet is growing too small, with things like publicly-available video transcripts and even AI-generated “synthetic data” as options.

While there are some companies, such as Dataology, which was formed by ex-Meta and Google DeepMind researcher Ari Morcos, looking into ways to train larger and smarter models with less data and resources, most big companies are looking into novel — and controversial — means of data training.

OpenAI, for instance, has per the WSJ‘s sources discussed training GPT-5 on transcriptions from public YouTube videos — even as its own chief technology officer, Mira Murati, struggles to answer questions about whether its Sora video generator was trained using YouTube data.

Don’t Panic

Synthetic data, meanwhile, has been the subject of ample debate in recent months after researchers found last year that training an AI model on AI-generated data would be a digital form of “inbreeding ” that would ultimately lead to “model collapse” or “Habsburg AI.”

Some companies, like OpenAI and Anthropic, which was formed by OpenAI in 2021 in efforts to build a safer and more ethical AI than those of their former employer, are seeking to head that off by creating supposedly higher-quality synthetic data — though of course, neither is letting press in on the secret sauce of what exactly that would entail.

Indeed, Anthropic admitted when announcing its Claude 3 LLM that the model was trained on “data we generate internally,” and in an interview with WSJ, chief company scientist Jared Kaplan said that he thinks there are good use cases for synthetic data as well.

While concerns about AI running out of data seem to have been spooking researchers for some time, researcher Pablo Villalobos told the newspaper that although his firm, Epoch, has estimated that AI will run out of usable training data within the next few years, there’s no reason for panic.

“The biggest uncertainty,” Villalobos said, “is what breakthroughs you’ll see.”

Then again, there is another obvious solution to this manufactured problem: AI companies could simply stop trying to create bigger and better models, given that aside from the training data shortage, they also use tons of electricity and expensive computing chips that require the mining of rare-earth minerals.

More on AI training: Microsoft and OpenAI Reportedly Building $100 Billion Secret Supercomputer to Train Advanced AI