AI Discussion: Copyright Dispute Analysis – Bartz vs. Anthropic: Preliminary Review of Generative AI Copyright Infringement Claims
The Northern District of California's recent summary judgment in Bartz v. Anthropic PBC is a significant milestone in the evolving relationship between generative AI and copyright law. The decision, handed down on June 23, 2025, provides valuable insights into how federal courts may approach copyright infringement claims related to large language models (LLMs) like Anthropic's Claude [1][2].
The court found that using copyrighted books to train AI models, such as Claude, is generally permissible under 17 U.S.C. § 107, given the transformative nature of AI training [1][3]. However, the court drew a clear distinction between the use of lawfully acquired works and those obtained from pirated sources. The long-term storage and broad availability of pirated books within Anthropic's internal repository were deemed not to qualify as fair use [1][2][3].
Key takeaways from the ruling include:
- **Lawfully Acquired Works:** Training LLMs with copyrighted material that is lawfully acquired is now viewed as "spectacularly transformative" and thus fair use by federal courts in this district [2][3]. - **Piracy and Format-Shifting:** The decision leaves open the possibility of liability if a company's training corpus includes pirated works or if works are digitized in a way that violates copyright [1][3]. - **No Binding Precedent:** While this decision is not binding on other federal courts, it is likely to be highly influential as one of the first judicial analyses directly on point for AI and copyright [1]. - **Output Liability Unaddressed:** The court did not decide whether outputs from trained LLMs could infringe copyright, as this was not alleged by plaintiffs [2].
Practical implications for AI companies include:
- **Clearance and Sourcing Matter:** AI developers must ensure their training data is legally obtained, as the fair use defense may not protect those who source from piracy or unauthorized repositories [1][2]. - **Documentation and Compliance:** Companies should audit their data acquisition and retention practices to demonstrate lawful sourcing and avoid claims of infringement based on improper storage or digitization [4]. - **Guidance for Future Litigation:** This case establishes an early, persuasive framework for how courts may evaluate similar claims against OpenAI, Perplexity, and other generative AI companies [1][4].
Unresolved issues include:
- **Pirated Works and Fair Use:** The court explicitly declined to rule on whether training on pirated works could ever qualify as fair use, leaving this question open for future cases [2]. - **Output Infringement:** No ruling was made on whether outputs from trained LLMs could infringe copyrights, an area likely to be the subject of future litigation [2]. - **Jurisdictional Differences:** The ruling is limited to the Northern District of California and is not binding elsewhere, so outcomes may vary in other circuits [1].
In conclusion, the Bartz v. Anthropic decision offers a tentative roadmap for copyright litigation involving generative AI: training on copyrighted works is likely fair use if the works are lawfully acquired, but sourcing from piracy or unauthorized copies remains risky [1][2][3]. AI companies must be vigilant about data provenance, as the legality of their operations may hinge on the methods used to acquire and manage training data. This case does not, however, resolve all potential copyright issues, especially regarding AI-generated outputs and the use of pirated content in training sets [2][3].
The decision has practical consequences for companies building or using training datasets, emphasizing the importance of proactive copyright risk management. The case involves authors whose books were used to train Anthropic's popular Claude LLMs. Judge Alsup emphasized that the pirated books were kept long-term, indexed, and made broadly available within the company. The Court distinguished between the training process and specific outputs, making clear that its decision did not address whether Claude might, in specific instances, output infringing content. Anthropic's motion for summary judgment with regard to its practice of acquiring and retaining a large, centralized internal library of pirated books was denied.
Companies should consider auditing the sources of their training corpora, removing unauthorized works, documenting material provenance, and seeking licenses or relying on compliant data providers. Vendor contracts with GenAI providers should be reviewed to ensure datasets provided for AI training do not contain unauthorized copyrighted content, and that indemnities cover potential copyright exposure. The decision signals that copyright compliance in the AI context must include not just how works are used, but also how they are acquired and handled internally. The Court ruled that Anthropic's use of books to train its LLMs qualified as fair use, as it was found to be transformative. The Court rejected Anthropic's argument that training Claude on plaintiffs' books displaced, or will displace, a licensing market of copyrighted materials being used for AI training. The Court found that purchasing and scanning millions of print copies of books, then destroying the print copies, could constitute fair use.
- In light of the decision, it is advisable for AI companies to meticulously verify the legality of their training data sources, as the fair use defense may not be sufficient for those sourcing from piracy or unauthorized repositories.
- The Bartz v. Anthropic case highlights the importance of internal data handling practices in AI copyright litigation, as the long-term storage and broad availability of pirated books within a company's repository was deemed not to qualify as fair use.