In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

Wondering what data OpenAI used to train its buzzy new text-to-video AI? OpenAI CTO Mira Murati seems to be wondering, too. — *Image: Wall Street Journal via YouTube / Futurism*

Wondering what data OpenAI used to train its buzzy new text-to-video AI? The company’s CTO is similarly unsure.

Mira Murati, OpenAI’s longtime chief technology officer, sat down with The Wall Street Journal’s Joanna Stern this week to discuss Sora, the company’s forthcoming video-generating AI. About halfway through the 10-minute-long interview, Stern straightforwardly asked Murati where the new model’s training data was gleaned from. But Murati, in the most cringe-inducing way possible, couldn’t find an answer beyond vague corporate language.

“We used publicly available data and licensed data,” Murati responded to the resoundingly simple question.

Stern pushed back with more specific source examples: “So, videos on YouTube?”

“I’m actually not sure about that,” said Murati, before rebuffing further queries about whether videos shared to Instagram or Facebook were fed into model.

“You know, if they were publicly available — publicly available to use,” the CTO answered, “but I’m not sure. I’m not confident about it.”

Stern then inquired about OpenAI’s data training partnership with the stock image company Shutterstock, asking if videos on the partnered platform were sucked into Sora’s training material. And this time? Murati decided to shut down the line of questioning altogether.

“I’m just not going to go into detail about the data that was used,” Murati continued. “But it was publicly available or licensed data.”

So, in sum, Murati can’t tell you exactly where the videos gobbled up by Sora first came from. But rest assured, the sourceless data was definitely, one hundred percent publicly available or licensed. Convincing stuff!

It’s a bad look all around for OpenAI, which has drawn wide controversy — not to mention multiple copyright lawsuits, including one from The New York Times — for its data-scraping practices. After all, if the company’s CTO can’t firmly tell you where its buzziest new model’s training data was sourced from, it doesn’t exactly communicate a particular amount of care for the issue from OpenAI’s higher-ups.

Me: What data was used to train Sora? YouTube videos?
OpenAI CTO: I'm actually not sure about that…

(I really do encourage you to watch the full @WSJ interview where Murati did answer a lot of the biggest questions about Sora. Full interview, ironically, on YouTube:… pic.twitter.com/51O8Wyt53c
— Joanna Stern (@JoannaStern) March 14, 2024

After the interview, Murati reportedly confirmed to the WSJ that Shutterstock videos were indeed included in Sora’s training set. But when you consider the vastness of video content across the web, any clips available to OpenAI through Shutterstock are likely only a small drop in the Sora training data pond.

Online, reactions to the clip were mixed, with many chalking Murati’s close-lipped responses up to a possible lack of candidness.

“So when *the CTO* of OpenAI is asked if Sora was trained on YouTube videos, she says ‘actually I’m not sure’ and refuses to discuss all further questions about the training data,” former LA Times tech columnist Brian Merchant wrote in an X-formerly-Twitter post. “Either a rather stunning level of ignorance of her own product, or a lie — pretty damning either way!”

“You’re the CTO ma’am,” added another netizen, “you should know.”

Others, meanwhile, jumped to Murati’s defense, arguing that if you’ve ever published anything to the internet, you should be perfectly fine with AI companies gobbling it up.

“Why does it matter? That is the question,” said one X user. “I find it insane that people make things public to everyone in the world and then complain when someone uses that public thing. If you want to be private, then be private.”

That latter argument, though, speaks to the bizarre new reality that internet users have now found themselves in. Historically, when someone told you to be careful of what you post online, the reasoning was something akin to “you might regret that later” — and not “a multibillion-dollar AI company might turn a profit by vacuuming that Facebook video of you and your family, or a goofy YouTube video you made with your friends, into a generative AI model.”

Whether Murati was keeping things close to the vest to avoid more copyright litigation or simply just didn’t know the answer, people have good reason to wonder where AI data — be it “publicly available and licensed” or not — is coming from. And moving forward, vague corporate mumbling probably isn’t going to cut it.

More on OpenAI and its data: OpenAI Says It’s Fine to Vacuum Up Everyone’s Content and Charge for It Without Paying Them