Anthropic used The Pile to train Claude. The Pile contained the Books3 dataset.
Books3 included copyrighted books.
Three authors sued the company for copyright infringement.
The debate on using copyrighted works to train generative artificial intelligence is far from being resolved. It’s one of the major flashpoints in the discussion, as these models have used content created by authors to learn how to create similar content without giving them any compensation. It’s a thorny issue, and several complaints about it have already been filed. This week, we add another to Anthropic, the company behind the Claude AI model.
What happened? According to the brief, three authors sued Anthropic for “building a multibillion-dollar business by stealing hundreds of thousands of copyrighted books.” The plaintiffs are Andrea Bartz, journalist and author of We Were Never Here; Charles Graeber, author of The Good Nurse; and Kirk Wallace Johnson, author of The Feather Thief.
Books3. Anthropic used Books3 to train its LLM, Claude. Books3 is a dataset containing 196,640 books in a text format by authors such as Stephen King, Margaret Atwood, and Zadie Smith. In other words, it includes potentially copyrighted content. The key is what happened after its creation: It became part of The Pile.
The Pile? This is a vast open-source dataset of 825 GiB (gigibyte) of English text created by EleutherAI. Companies use it to train LLMs. It consists of several smaller datasets, including Books3 and YouTube Subtitles. An investigation by Proof News and Wired suggests that Nvidia, SalesForce, Anthropic, and Apple have used them to train their models.
Anthropic confirmed earlier this month that it had used The Pile to train Claude. Although it removed Books3 in August 2023, the authors claim in the lawsuit that while it’s true that Books3 isn’t part of the “more official” version of The Pile, the original version is still available online.
In any case, the lawsuit alleges that, “It is apparent that Anthropic downloaded and reproduced copies of The Pile and Books3, knowing that these datasets were comprised of a trove of copyrighted content sourced from pirate websites like Bibiliotik.” The authors want the court to order the company to pay damages and to force Anthropic to stop using copyrighted content.
It’s not the first time. And it probably won’t be the last. Since the advent of generative AI, copyright infringement lawsuits haven’t stopped. That explains why companies like OpenAI have taken a different approach: partnering. The company behind ChatGPT has partnered with the Associated Press, Axel Springer, Vox Media, and Condé Nast to use their content to train its AI.
However, it still has a thorn in its side: The New York Times, one of the world’s most influential media outlets, sued OpenAI and Microsoft in late 2023 over the use of its content. Alden Global Capital’s media outlets, which include The New York Daily News, The Chicago Tribune, and The Orlando Sentinel, also joined the suit. Alden is the second-largest newspaper publisher in the U.S.
In Anthropic’s case, this isn’t the first lawsuit it's faced. Last October, Universal Music Group, Concord Publishing, and ABKCO Music & Records sued the company for using “lyrics from numerous musical compositions” to train its AI. According to the lawsuit, Claude can generate identical or nearly identical lyrics to about 500 songs, including some by Beyonce or the Rolling Stones.
This article was written by Jose García and originally published in Spanish on Xataka.
Image | Anthropic and Xataka On
Related | Google Has Just Suffered a Major Loss in a U.S. Antitrust Lawsuit: The Search Giant Staggers
See all comments on https://www.xatakaon.com
SEE 0 Comment