The launch of ChatGPT and other deep learning quickly led to a flurry of lawsuits against model developers. Legal theories vary, but most are rooted in copyright: plaintiffs argue that use of their works to train the models was infringement; developers counter that their training is fair use. Meanwhile developers are making as many licensing deals as possible to stave off future litigation, and it’s a sound bet that the existing litigation is an elaborate scramble for leverage in settlement negotiations.

These cases can end one of three ways: rightsholders win, everybody settles, or developers win. As we’ve noted before, we think the developers have the better argument. But that’s not the only reason they should win these cases: while creators have a legitimate gripe, expanding copyright won’t protect jobs from automation. A win for rightsholders or even a settlement could also lead to significant harm, especially if it undermines fair use protections for research uses or artistic protections for creators. In this post and a follow-up, we’ll explain why.

State of Play

First, we need some context, so here’s the state of play:

DMCA Claims

Multiple courts have dismissed claims under Section 1202(b) of the Digital Millennium Copyright Act, stemming from allegations that developers removed or altered attribution information during the training process. In Raw Story Media v. OpenAI, Inc., the Southern District of New York dismissed these claims because the plaintiff had not “plausibly alleged” that training ChatGPT on their works had actually harmed them, and there was no “substantial risk” that ChatGPT would output their news articles.

Courts granted motions to dismiss similar DMCA claims in Andersen v. Stability AI, Ltd., The Intercept Media, Inc. v. OpenAI, Inc., Kadrey v. Meta Platforms, Inc., and Tremblay v. OpenAI.

Another such case, Doe v. GitHub, Inc. will soon be argued in the Ninth Circuit.

Copyright Infringement Claims

Rightsholders also assert ordinary copyright infringement, and the initial holdings are a mixed bag. In Kadrey v. Meta Platforms, Inc., the court dismissed “nonsensical” claims that Meta’s LLaMA models are themselves infringing derivative works.

In Andersen v. Stability AI Ltd., however, the court held that copyright claims based on the assumption that the plaintiff’s works were included in a training dataset could go forward, where use of plaintiffs’ names as prompts generated images “similar to plaintiffs’ artistic works.” The court also held that the plaintiffs plausibly alleged the model was designed to “promote infringement.”

It's early in the case—the court was merely deciding if the plaintiffs had alleged enough to justify further proceedings—but it’s a dangerous precedent. Copyright protection extends only to the actual expression of the author—the underlying facts and ideas are not protected. That means while a model cannot output an identical copy without infringing, it is free to generate stylistically “similar” images. Training alone is insufficient to give rise to a claim of infringement, and the court impermissibly conflated similar outputs with copying protectable expression.

Fair Use

In most of the AI cases, courts have yet to consider—let alone decide—whether fair use applies. In one unusual case, however, the judge flip-flopped. In Thomson Reuters Enterprise Centre GMBH v. Ross Intelligence, Inc., the issue concerned legal research technology. Thomson Reuters prepared annotations describing legal opinions. Ross hired lawyers to rewrite them, and that output trained Ross’s search tool.

Originally, the court got it right, holding that using copyrighted works “as a step in the process of trying to develop a ‘wholly new,’ albeit competing, product” was transformative intermediate copying—i.e., fair use.

After reconsidering, however, the judge changed his mind, essentially disagreeing with prior case law regarding search engines. We think an appeals court is unlikely to uphold this divergence. If it did, it could pose legal problems for AI developers—and anyone building search tools.

Copyright law favors new technologies that help learn and locate information—even if developing the tool required copying content to index it. Ross’s tool provided links to legal opinions, not Thomson Reuters content. It concerned non-copyrightable legal holdings, not creative annotations.

Thomson Reuters has often pushed copyright’s limits—such as claiming proprietary rights to legal opinion page numbers. Sadly, the judge in this case enabled them to do so again. We hope the appeals court reverses.

The Side Deals

While all of this is going on, developers like OpenAI and Google have made multimillion-dollar licensing deals with Reddit, the Wall Street Journal, and other copyright owners. There’s now a $2.5 billion licensing market for training data—even though using that data is almost certainly fair use.

What’s Missing

This litigation is getting plenty of attention—and it should. The stakes are high. But the real stakes are getting lost. These cases aren’t just about who profits from generative AI. The outcomes will decide whether only companies with deep pockets can shape the future of AI.

More on that tomorrow.

This post is part of our AI and Copyright series. Check out our other post in this series.