"A video data factory that can yield a human lifetime visual experience worth of training data per day."

Leaked documents obtained by 404 Media reveal that AI-powering chip giant Nvidia has been quietly scraping astronomical numbers of YouTube video data to train its AI models — a legally and ethically murky decision that adds to the ever-growing pile of deeply questionable, and often very secretive, AI training practices by entities ranging from startups to corporate giants.

According to 404's explosive scoop, Nvidia has obtained an eye-watering amount of YouTube data to train AI models including its Cosmos deep learning model, a self-driving car algorithm, a "digital human" AI avatar product, and its 3D world-building tool called Omniverse.

Nvidia also reportedly took pains to hide its activities from YouTube, using dozens of "virtual machines" that automatically changed their IP addresses to avoid detection.

Neither individual video creators nor YouTube owner Google, a notable Nvidia customer, consented to Nvidia's data scraping. And internal correspondence between Nvidia employees, including from its higher-ups, reveals a wildly brash, ask-questions-later — or ask-questions-hopefully-never — approach to the covert data-vacuuming campaign.

"We are finalizing the v1 data pipeline and securing the necessary computing resources," Ming-Yu Liu, Nvidia's VP of Research and a leader on the Cosmos project, wrote in a May email, according to 404, "to build a video data factory that can yield a human lifetime visual experience worth of training data per day."

What's more, in response to employee concerns regarding the legality and ethics of Nvidia's newfound data acquisition practices, managers including Liu insisted that the move was approved from the top down.

"This is an executive decision," Liu wrote to a hesitant underling on one such occasion, according to Slack messages reviewed by 404. "We have an umbrella approval for all of the data."

In one particularly egregious case, documents obtained by 404 revealed that Nvidia at one point knowingly trained its models on HD-VG-130M, a dataset trained on 130 million YouTube videos created explicitly for academic research. Given that Nvidia was using that academic data to train commercial models, its a horrible look.

"I think there's a huge gap between commercializing something without someone's consent," Shayne Longpre, a PhD Candidate at the MIT Media Lab, told 404 of the misuse of research-intended data, "versus studying the generative AI capabilities based off of things that have been publicly put online."

Nvidia has emerged as a central player in the AI industry due to its market dominance over graphic processing units (GPUs), which are the computing chips that often support compute-heavy AI systems. AI companies including OpenAI, Microsoft, Meta, and — again — Google count themselves as Nvidia customers, rendering Nvidia's sneaky use of what ultimately is Google-owned data all the more scandalous. Every major player in the AI industry is battling it out for dominance — including Nvidia, the market's hardware backbone, and now a proven frenemy.

Indeed, when asked by 404 about Nvidia's scraping practices, a spokesperson for Google pointed to an April interview in which YouTube CEO Neal Mohan told Bloomberg that using YouTube's data without permission is in "clear violation" of the platform's terms of service.

"When a creator uploads their hard work to our platform, they have certain expectations," Mohan told Bloomberg. "One of those expectations is that the terms of service is going to be abided by. It does not allow for things like transcripts or video bits to be downloaded, and that is a clear violation of our terms of service."

In a statement to 404, Nvidia claimed that its AI training practices are "in full compliance with the letter and the spirit of copyright law." The jury's still out, of course, on how the humans who made the allegedly lifetimes' worth of content now powering the chip maker's AI systems feel about that.

More on Nvidia: Is the Tech Stock Collapse Related a Sign of the AI Bubble Popping?


Share This Article