OpenAI Whistleblower Disgusted That His Job Was to Vacuum Up Copyrighted Data to Train Its Models

A former OpenAI staffer is blowing the whistle on the company's AI training practices, alleging they violated copyright law. — TURIN, ITALY - SEPTEMBER 25: Sam Altman Co-founder and CEO of OpenAI speaks during the Italian Tech Week 2024 at OGR Officine Grandi Riparazioni on September 25, 2024 in Turin, Italy. (Photo by Stefano Guidi/Getty Images) *Image: Stefano Guidi/Getty Images*

Sounding the Alarm

A former OpenAI researcher is blowing the whistle on the company’s AI training practices, alleging that OpenAI violated copyright law to train its AI models — and arguing that OpenAI’s current business model stands to upend the business of the internet as we know it, according to The New York Times.

The ex-staffer, a 25-year-old named Suchir Balaji, worked at OpenAI for four years before deciding to leave the AI firm due to ethical concerns. As Balaji sees it, because ChatGPT and other OpenAI products have become so heavily commercialized, OpenAI’s practice of scraping online material en masse to feed its data-hungry AI models no longer satisfies the criteria of the fair use doctrine. OpenAI — which is currently facing several copyright lawsuits, including a high-profile case brought last year by the NYT — has argued the opposite.

“If you believe what I believe,” Balaji told the NYT, “you have to just leave the company.”

Balaji’s warnings, which he outlined in a post on his personal website yesterday, add to the ever-growing controversy around the AI industry’s collection and use of copyrighted material to train AI models, which was largely conducted without comprehensive government regulation and outside of the public eye.

“Given that AI is evolving so quickly,” intellectual property lawyer Bradley Hulbert told the NYT, “it is time for Congress to step in.”

Flipping the Switch

Balaji, who was hired in 2020, was one of several staffers tasked with collecting and organizing web-gathered training data that would eventually be fed into OpenAI’s large language models (LLMs). Because OpenAI was still technically just a well-funded research company at the time, the issue of copyright wasn’t as big of a deal.

“With a research project, you can, generally speaking, train on any data,” Balaji told the NYT. “That was the mindset at the time.”

But once ChatGPT was released in November 2022, Balaji says, his feelings started to change. After all, the chatbot was no longer a closed-door research project; instead, powered by OpenAI’s LLMs, it was being commodified for commercial use — including in cases where the AI was being used to produce content or services that directly reflected or mimicked the copyrighted source material it was trained on, thus threatening the livelihoods and profit models of those very individuals and businesses.

“This is not a sustainable model,” Bilaji told the NYT, “for the internet ecosystem as a whole.”

For its part, in a statement to the NYT, OpenAI — which has since abandoned its non-profit roots entirely — argued that it builds its “AI models using publicly available data, in a manner protected by fair use and related principles” and that is “critical for “US competitiveness.”

More on OpenAI: OpenAI Pivoting From “Benefiting Humanity” to “Making Lots of Money”