Anthropic Admits to Removing Millions of Pirated Books from Claude's Training Data Amid Copyright Concerns

In a significant development for AI ethics and copyright law, Anthropic has reportedly removed millions of pirated books from the training data used to build its AI assistant Claude. According to Business Insider, the company took this step after discovering that a substantial portion of its training corpus included unauthorized digital copies of books. This move comes as AI companies face increasing scrutiny over their use of copyrighted materials without permission or compensation to creators.

The decision highlights the growing tension between AI development and intellectual property rights. While large language models require vast amounts of text to achieve their capabilities, questions about the legality and ethics of using copyrighted works have intensified. Anthropic’s action appears to be proactive damage control as the AI industry faces multiple lawsuits from authors and publishers, including high-profile cases against OpenAI and other major players in the generative AI space.

This revelation could have far-reaching implications for how AI companies approach training data curation going forward. As legal challenges mount and regulatory attention increases, companies may need to develop more transparent practices around data sourcing and potentially establish compensation frameworks for content creators whose works contribute to AI development. Anthropic’s decision may signal a shift toward more cautious approaches to training data acquisition across the industry as companies attempt to balance innovation with respect for intellectual property rights.

Source: https://www.businessinsider.com/anthropic-cut-pirated-millions-used-books-train-claude-copyright-2025-6