Crisis Looms as AI Companies Rapidly Losing Access to Training Data

Many content makers have put up restrictions on their content in the past year, which prevents AI companies from scraping them for data. — *Image: Getty / Futurism*

Data Crash

AI companies typically build their AI models on lots of publicly available content, from YouTube videos to newspaper articles. But many of these content hosts have now started to put up restrictions on their content.

Those new restrictions could bring about a “crisis” that would make these AI models less effective, according to a new study by the Massachusetts Institute of Technology’s Data Provenance Initiative.

The researchers performed an audit of 14,000 websites that are scraped by prominent AI training data sets. The intriguing result: that about 28 percent “of the most actively maintained, critical sources” on the internet are now “fully restricted from use.”

The administrators of these websites have made these restrictions by adding increasingly stringent limitations to how web crawler bots are allowed to scrape their content.

“If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems,” the researchers write.

No Free Lunch

It’s understandable that content hosts would put restrictions on their cache of now-valuable data.

AI companies have taken this publicly available material, much of it copyrighted, and are using it to make money without permission. This has understandably upset many, from The New York Times to celebrities like Sarah Silverman.

What’s particularly galling is that people like OpenAI CTO Mira Murati are saying that some creative jobs should disappear — even though it’s the content made by these creative people that power models like OpenAI’s ChatGPT.

The arrogance on display, and the resulting blowback, have created a “consent in crisis,” as the study researchers call it — meaning the once free-willing internet with no walls is becoming a thing of the past, and AI models will be more biased, less diverse and less fresh.

Some companies are now hoping to work around these constraints by using synthetic data, which is essentially data generated by AI, but so far that’s been a poor substitute to original content produced by actual human beings.

Others, like OpenAI, have struck deals with media companies, but many have expressed alarm at these agreements — for good reason, because the goals of tech companies and media outfits are at odds.

Time will tell how the whole thing shakes out. One thing’s for sure, though: stockpiles of training data are becoming more valuable — and scarce — than ever.

More on AI: Even Google’s Own Researchers Admit AI Is Top Source of Misinformation Online

Crisis Looms as AI Companies Rapidly Losing Access to Training Data

Data Crash

No Free Lunch

People Are Absolutely Roasting Sports Illustrated’s Ridiculous Excuse for Its AI-Generated Writers

Investors Are Using AI to Detect CEOs’ Emotional States

AI Developers Are Already Quietly Training AI Models Using AI-Generated Data

Tech Conference Canceled After Using AI to Generate Fake Women Speakers

Guy Launches News Site That’s Completely Generated by AI

University Enrolling AI-Powered “Students” Who Will Turn in Assignments, Participate in Class Discussions

Leaked Google Memo Shows Fear of Losing the AI Race, But Not to the Foe You’d Think

Humane’s Uber-Expensive “AI Pin” Sounds Like a Total Disaster

FOLLOW US

DISCLAIMER(S)

Sign up to see the future, today

Data Crash

No Free Lunch