A legal expert found that Meta's AI is able to spit out entire portions of books verbatim — and if he's right, it could be seriously bad news for the company and its CEO Mark Zuckerberg.

First, a quick primer. All the AI that's commercially buzzy at the moment, like OpenAI's ChatGPT or Meta's Llama, is trained by feeding in huge amounts of data. Then researchers do a bunch of number crunching using algorithms, basically teaching the system to recognize patterns in all that data so thoroughly that it can then create new patterns — meaning that, say, if you ask for a summary of the plot of one of the "Harry Potter" books, it'll give you (hopefully) a reasonable overview.

The problem, Stanford tech law expert Mark Lemley explains in an interview with New Scientist, is that his team's research found that Meta's LLaMA is able to repeat verbatim the exact contents of copyrighted books — such as, in one example he found, lengthy passages from the multi-billion dollar "Harry Potter" series.

For Meta, this is a gigantic legal liability. Why? Because if its AI is producing entire excerpts of material used to train it, it starts to look less like its AI is producing transformative works based on general patterns about language and the world it learned from its training data, and more like the AI is acting like a giant .ZIP file of copyrighted work, which users can then reproduce at will.

And it looks a lot like it is. When testing out various AI models by companies including OpenAI, DeepSeek, and Microsoft, Lemley's team found that Meta's LLaMA was the only one that spat out book content exactly. Specifically, the researchers found that LLaMA seemed to have memorized material including the first book in J.K. Rowling's "Harry Potter" series, F. Scott Fitzgerald's "The Great Gatsby," and George Orwell's "1984."

It's not under debate that Meta, like its peers in the tech industry, used copyrighted materials to train its AI. But its specific methodology for doing so has come under fire: it emerged in copyright lawsuit against Meta by authors including the comedian Sarah Silverman that the model was trained on the "Books3" dataset, which contains almost 200,000 copyrighted publications and which Meta engineers downloaded using an illegal torrent ("Torrenting from a [Meta-owned] corporate laptop doesn't feel right," one of them fussed while doing so, in messages produced in court.)

Lemley and his team estimate that if just three percent of the Books3 dataset were found to be infringing, the company behind it could owe nearly $1 billion in statutory damages, and that's not counting any additional payouts based on profits gleaned from such theft. And if the proportion of infringing content is higher, at least in theory Meta could end up nailed to the wall.

Lemley is in a weird position, by the way. He previously defended Meta in that same lawsuit we mentioned above, but earlier this year, the Stanford professor announced in a LinkedIn post that he would no longer be representing the company in a protest of Meta and Zuckerberg's right-wing virtue signaling. Back then, he said he believed Meta should win its case — but based on his new research, it sounds like that opinion may have shifted.

Meta declined to comment to New Scientist about Lemley's findings.

More on Meta: Meta Says It's Okay to Feed Copyrighted Books Into Its AI Model Because They Have No "Economic Value"


Share This Article