"This is my livelihood, and I put time, resources, money, and staff time into creating this content."

Dirty Work

A giant dataset of YouTube subtitles has, per a new investigation, been used to train countless AI models without the permission of the tens of thousands of creators whose work was scraped.

As Wired reports with the help of the data-driven Proof News project, a dataset known as "YouTube Subtitles" has been used by everyone from Apple and Anthropic to Nvidia and Salesforce to train AI models since it was released in 2020.

Compiled by the open-source nonprofit EleutherAI, the YouTube Subtitles dataset doesn't include any actual video, but instead subtitle data from 173,536 videos gleaned from more than 48,000 channels. Among those channels were everything from MIT and Harvard to MrBeast and the BBC, among many others.

Of all the channel owners that Proof managed to speak with for the story, none had been made aware ahead of time that ElutherAI had used subtitles from their videos.

Forgiveness, Not Permission

One of the impacted creators, the progressive vlogger David Pakman, was mighty peeved when he learned from Proof about his videos being included in the dataset.

"No one came to me and said, 'We would like to use this,'" the commentator, who had nearly 16o videos used in the dataset, told Wired. "This is my livelihood, and I put time, resources, money, and staff time into creating this content."

According to AI policy researcher Jai Vipra of Brazil's Fundação Getulio Vargas Law School, the YouTube Subtitles dataset is a "gold mine" because it can teach models how to replicate human speech.

To science vlogger Dave Farina of the popular "Professor Dave Explains" series, however, that gold mine comes at a cost to creators.

"It's still the sheer principle of it," Farina told Wired. "If you’re profiting off of work that I’ve done that will put me out of work or people like me out of work, then there needs to be a conversation on the table about compensation or some kind of regulation."

When Proof reached out to YouTube owner Google, EleutherAI, and the companies that had used the dataset, only a Google spokesperson chose to respond publicly to say that the company has taken "action over the years to prevent abusive, unauthorized scraping."

It's a provocative state of affairs — and it's hard to tell at this juncture how to fix it if companies won't even speak on the record about it.

More on AI data: AI Is Being Trained on Images of Real Kids Without Consent


Share This Article