Meta often promotes Llama as an open-source AI model. What it doesn’t disclose is the dataset it used for training. Legal proceedings have forced the company to reveal details about that process—and they raise serious questions.
Books used without permission. In the ongoing Kadrey v. Meta case, bestselling authors like Sarah Silverman and Ta-Nehisi Coates accuse Meta of training its AI models using copyrighted works. Unsealed documents suggest these claims are accurate.
Zuckerberg gave the green light. According to testimony, Meta CEO Mark Zuckerberg approved the use of a Library Genesis (LibGen) dataset to train Llama models, despite internal warnings. Some Meta employees reportedly cautioned that relying on LibGen could “undermine Meta’s negotiating position with regulators.”
What’s LibGen? Library Genesis describes itself as a “link aggregator.” In reality, it’s a vast online library offering access to copyrighted works from publishers such as McGraw Hill and Pearson Education. LibGen has faced multiple lawsuits and over $30 million in fines for copyright infringement. Despite its legal troubles, the platform’s elusive operators have made it difficult for publishers to enforce judgments or recover funds.
Data-hungry practices. In April 2024, The New York Times detailed how tech companies are pursuing massive datasets to train their AI models. The report alleged that Meta hired workers in Africa to extract summaries of copyrighted books, arguing that “you can’t collect that data.” Ironically, Meta has accused OpenAI of similar practices, citing the difficulty of negotiating licenses with publishers, artists, and other copyright holders.
Erasing copyright markings. The plaintiffs’ attorney claims that Meta engineer Nikolay Bashlykov developed software to strip copyright information from e-books and scientific journal articles used to train Llama. This deliberate action aimed to obscure the source of the data.
Facilitating distribution. In addition to using LibGen’s copyrighted works for training, Meta allegedly became a distribution node in LibGen’s torrent network, further exacerbating its role in copyright infringement.
A complex case. While these allegations primarily involve earlier versions of Llama, the legal implications remain unresolved. In 2023, a court dismissed similar charges against Meta, with the company arguing “fair use” of the data. However, this defense may not succeed this time, as Judge Vince Chhabria has allowed critical documents to remain part of the case.
Meta isn’t alone. Although this lawsuit targets Meta, other tech companies face similar scrutiny. The New York Times has sued Microsoft and OpenAI, and Alden Global filed a lawsuit in April 2024 accusing OpenAI of copyright violations. However, OpenAI has recently struck licensing deals with publishers, including The Associated Press and Le Monde, to use their content legally.
Meanwhile, Google has declared its intent to use publicly available online content to train its models. Perplexity, too, has employed similar practices, raising questions about whether they’ve both used copyrighted works.
Image | Olena Bohovyk (Unsplash) | Meta
Related | Meta to Dismantle Its Anti-Fake News System After Years of Investment
View 0 comments