Meta Trained Llama Using Copyrighted Books. Mark Zuckerberg Knew It and Didn’t Care

Meta often promotes Llama as an open-source AI model. What it doesn’t disclose is the dataset it used for training. Legal proceedings have forced the company to reveal details about that process—and they raise serious questions.

Books used without permission. In the ongoing Kadrey v. Meta case, bestselling authors like Sarah Silverman and Ta-Nehisi Coates accuse Meta of training its AI models using copyrighted works. Unsealed documents suggest these claims are accurate.

Zuckerberg gave the green light. According to testimony, Meta CEO Mark Zuckerberg approved the use of a Library Genesis (LibGen) dataset to train Llama models, despite internal warnings. Some Meta employees reportedly cautioned that relying on LibGen could “undermine Meta’s negotiating position with regulators.”

What’s LibGen? Library Genesis describes itself as a “link aggregator.” In reality, it’s a vast online library offering access to copyrighted works from publishers such as McGraw Hill and Pearson Education. LibGen has faced multiple lawsuits and over $30 million in fines for copyright infringement. Despite its legal troubles, the platform’s elusive operators have made it difficult for publishers to enforce judgments or recover funds.

Data-hungry practices. In April 2024, The New York Times detailed how tech companies are pursuing massive datasets to train their AI models. The report alleged that Meta hired workers in Africa to extract summaries of copyrighted books, arguing that “you can’t collect that data.” Ironically, Meta has accused OpenAI of similar practices, citing the difficulty of negotiating licenses with publishers, artists, and other copyright holders.

Mark Zuckerberg Has a New Proclamation: Smart Glasses Will Replace Smartphones by 2030

Meta Trained Llama Using Copyrighted Books. Mark Zuckerberg Knew It and Didn’t Care

Documents revealed in a recent lawsuit show how Zuckerberg made a decision that could pose legal and ethical risks.

Other tech companies, including OpenAI, Google, and Perplexity, are navigating similar controversial practices.

Receive "Xatakaletter", our weekly newsletter