The Tech Industry Said It Was "Impossible" to Create AI Based Entirely on Ethically-Sourced Data, So These Scientists Proved Them Wrong in Spectacular Fashion

A team of more than two dozen AI researchers from MIT, Cornell University, the University of Toronto, and other institutions have trained a large language model only using data that was openly licensed or in the public domain, the Washington Post reports, providing a blueprint for ethically developing the technology.

But, as the creators readily admit, it was far from easy.

As they describe in a yet-to-be-peer-reviewed paper published this week, it quickly became apparent that it wouldn't be computing power holding them back, but personpower.

That's because the text in the over eight terabyte dataset they put together, which they're calling the Common Pile v0.1, had to be manually cleaned up and reformatted to make it suitable for AI training, WaPo explains. Then there was the amazing amount of extra legwork that had to be done of doublechecking the copyright status of all the data, since many online works are improperly licensed.

"This isn't a thing where you can just scale up the resources that you have available," like access to more computer chips and a fancy web scraper, study coauthor Stella Biderman, a computer scientist and executive director of the nonprofit Eleuther AI, told WaPo. "We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that's just really hard."

Still, Biderman and her colleagues did get the job done.

Once the painstaking odyssey of creating the Common Pile was over, they used their guilt-free dataset to train a seven billion-parameter LLM. The result? An AI that admirably stacks up against industry models like Meta's Llama 1 and Llama 2 7B — which is impressive, but those were versions released over two years ago. That's practically a lifetime in the AI race.

Of course, this was accomplished by a more or less ragtag team and not a corporation with billions of dollars of resources, and had to make up for this in scrappiness. One particularly resourceful find was a set of over 130,000 English language books in the Library of Congress that'd been overlooked.

Copyright remains one of the biggest ethical and legal questions looming over AI. Leaders like OpenAI and Google burned through unfathomable amounts of data on the surface web to get to where they are, devouring everything from news articles to stuff as invasive as your social media posts. And Meta has been sued by authors who allege that it illegally used seven million copyrighted books that it pirated to train its AIs.

The tech industry has justified its rapacious data demands by arguing that it all counts as fair use — and more existentially, that it would be "impossible" to develop this technology without vacuuming everyone's content up for free.

This latest work is a rebuff to that Silicon Valley line, though it doesn't obviate all ethical concerns. This is still a large language model, a technology fundamentally intended to destroy jobs, and perhaps not everyone whose work has ended up in the public domain would be happy with it being regurgitated by AI — if they aren't dead artists whose copyright has elapsed, of course.

Even if AI firms are reined in and are made to only use works with permission or compensation — a big if — the fact remains that as long as these companies stick around, there will be significant pressure on copyright holders to allow AI training.

Biderman herself doesn't have any illusions that the likes of OpenAI will suddenly turn over a new leaf and start being paragons of ethical data sourcing. But she hopes her work will at least get them to stop hiding what they're using to train their AI models.

"Even partial transparency has a huge amount of social value and a moderate amount of scientific value," she told WaPo.

More on AI: If You Thought Facebook Was Toxic Already, Now It's Replacing Its Human Moderators with AI

Share This Article