A Group of Authors Has Sued Anthropic. The Reason: It Trained Its AI With Copies of Their Books

Anthropic used The Pile to train Claude. The Pile contained the Books3 dataset.
Books3 included copyrighted books.
Three authors sued the company for copyright infringement.

August 21, 2024, 16:47 ET

Karen Alfaro

The debate on using copyrighted works to train generative artificial intelligence is far from being resolved. It’s one of the major flashpoints in the discussion, as these models have used content created by authors to learn how to create similar content without giving them any compensation. It’s a thorny issue, and several complaints about it have already been filed. This week, we add another to Anthropic, the company behind the Claude AI model.

What happened? According to the brief, three authors sued Anthropic for “building a multibillion-dollar business by stealing hundreds of thousands of copyrighted books.” The plaintiffs are Andrea Bartz, journalist and author of We Were Never Here; Charles Graeber, author of The Good Nurse; and Kirk Wallace Johnson, author of The Feather Thief.

Books3. Anthropic used Books3 to train its LLM, Claude. Books3 is a dataset containing 196,640 books in a text format by authors such as Stephen King, Margaret Atwood, and Zadie Smith. In other words, it includes potentially copyrighted content. The key is what happened after its creation: It became part of The Pile.

The Pile? This is a vast open-source dataset of 825 GiB (gigibyte) of English text created by EleutherAI. Companies use it to train LLMs. It consists of several smaller datasets, including Books3 and YouTube Subtitles. An investigation by Proof News and Wired suggests that Nvidia, SalesForce, Anthropic, and Apple have used them to train their models.

Anthropic confirmed earlier this month that it had used The Pile to train Claude. Although it removed Books3 in August 2023, the authors claim in the lawsuit that while it’s true that Books3 isn’t part of the “more official” version of The Pile, the original version is still available online.

As of August 2023, Books3 is no longer part of The Pile, but the original version is still online.

In any case, the lawsuit alleges that, “It is apparent that Anthropic downloaded and reproduced copies of The Pile and Books3, knowing that these datasets were comprised of a trove of copyrighted content sourced from pirate websites like Bibiliotik.” The authors want the court to order the company to pay damages and to force Anthropic to stop using copyrighted content.

A Group of Authors Has Sued Anthropic. The Reason: It Trained Its AI With Copies of Their Books

Anthropic used The Pile to train Claude. The Pile contained the Books3 dataset.

Books3 included copyrighted books.

Three authors sued the company for copyright infringement.