How and when can copyrighted materials be used to train AI large language models? In the US, the federal court system is hearing about 30 cases that will, starting next year, begin to build answers, with cases in California, New York and Delaware all contending to produce that watershed first "fair use" determination. A smaller group of privacy cases are pursuing claims that personal information was illegally used to train LLMs, but it's unclear whether or not the privacy claims are strong enough to move forward.
Not long after the dawn of the commercial Internet, a small, newly minted company called Google triggered legal battles about whether US copyright law prevented two key innovations.One was a dispute over whether Google’s display of “thumbnail” versions of images in search results infringed copyright to the original image. In the other, Google launched a project to scan the world’s books to transmute words on paper into a searchable database, “a card catalog for the digital age."
Each disputed use was ultimately ruled by US courts to be a transformative fair use, though the Google Books case took a decade, ending only in 2015 (see here).
A generation after those skirmishes started, the US court system is in the first stages of what is likely to be an extended battle that could ultimately reach the Supreme Court, all about the use of massive volumes of data —including copyrighted works such as books and music, as well as personal data — to train generative artificial intelligence large language models.
— Courtroom questions —
In roughly 30 lawsuits playing out in federal courts in the San Francisco Bay Area, New York City and a few other places, US judges will next year start issuing rulings that answer whether copyrighted works can used to train AI models, or whether AI companies will have to license copyright-protected data.
A number of those lawsuits also confront privacy questions about the use of personal information to train AI, but those cases are not as numerous as copyright suits. Nor do the roughly half-dozen privacy suits — at least this early in the litigation process — appear to have the legs of the copyright cases, with a district court judge in San Francisco recently kicking out a privacy suit against OpenAI and Microsoft.
Judge Vince Chhabria acidly dismissed an experienced privacy plaintiff law firm’s arguments about the ways political leaders and European governments have reacted to recent advancements in AI technology as “irrelevant information” that didn’t qualify as legal arguments.
“The development of AI technology may well give rise to grave concerns for society, but the plaintiffs need to understand that they are in a court of law, not a town hall meeting,” Chhabria wrote in the order (see here). The firm, Morgan & Morgan, elected not even to amend its complaint, allowing the case to expire in late June.
Chhabria could be the US judge who issues the historic first ruling on whether the use of copyrighted works to train an AI model is a transformative fair use, in a suit against Meta Platforms brought by a group of writers and performers including comedian Sarah Silverman and author Ta-Nehisi Coates.
Chhabria said during a recent hearing that if the case progresses steadily, he could issue a summary judgment ruling on fair use next March or April. However, Chhabria has forced the reconstitution of the plaintiffs’ legal team, meaning that timing could be in question.
“This is not your typical proposed class action,” Chhabria said in a hearing in September on the question of whether the use of copyrighted works to train a generative AI system is a transformative fair use (see here). “This is an important case. It's an important societal issue. It's important for your clients, important for the proposed class members. It's important for society.”
Other AI litigation being closely watched by legal scholars include a suit brought by The New York Times against OpenAI and Microsoft over them using the newspaper's journalism to train ChatGPT, which is progressing slowly in federal court in New York amid a scorched-earth battle over evidence discovery.
Another case closely watched by legal scholars and the tech industry is Thomsen Reuters versus Ross Intelligence in the federal court of Delaware, a case filed in 2020 in which the legal publisher of Westlaw sued Ross for trying to develop an AI legal chatbot based on its scraping of Westlaw’s notes about court filings.
Until the judge indefinitely postponed a trial this summer, the Ross case had been poised to be the first US case with a ruling on whether the use of copyrighted works to train AI was a legal fair use.
The first fair use decision, whether it comes next year or sooner, will be a stake in the ground around which other copyright decisions will be oriented. “All of the cases that the court renders a decision, the first few that start rolling out, I think, will be looked at by other judges” around the US, said Edward Lee, a law professor at Santa Clara University in Silicon Valley who publishes the blog ChatGPTiseatingtheworld.com. “Judges will be concerned about not coming out with inconsistent analyses of fair use.”
That doesn’t mean everyone will agree, Lee and other experts say. The facts, the AI technologies and the companies involved in use of copyrighted or personal data will vary from case to case. The district courts in California, New York, Massachusetts and Delaware that are hearing the AI cases are in four different appeals circuit courts if appealed. That raises the threat, perhaps five years from now, of a circuit split that the Supreme Court would have to decide.
If the time it took to resolve fair use questions around revolutionary technology such as Google Books plays out with AI — and there is no reason to think it won’t — it could be the early 2030s before the Supreme Court reviews appeal court cases and the US rules are set permanently.
“I think there is a fair possibility that [the Supreme Court] will get involved, because the issue is of tremendous importance to the United States, and if there are different approaches among the circuit courts, like a split on the fair use question, that’s something the Supreme Court usually gets involved with,” Lee said.
“Even if there is not a circuit split, this is the kind of issue the Supreme Court might find it wanted to get involved in. I think everybody knows AI is a transformative technology.”
— AI is different —
There are huge ramifications for content creators as well as for AI companies, and these early US court rulings will also be big for the tech giants. The licensing of copyrighted data for AI training is already a significant source of revenue for some companies and creators.
Reddit, for example, said in its IPO securities filing this year that it has already inked licensing deals valued at $203 million with OpenAI, Google and other AI companies, and that it believes licensing data for AI training could be a $1 trillion market by 2027. “We believe the importance of data to all types of analytics and AI, from training to testing and refining models, positions us well to tap into this strong market,” Reddit said.
A ruling that the use of copyrighted works for AI training is not fair use would strengthen the hands of licensors like Reddit, but Meta Platforms' chief executive, Mark Zuckerberg, says content creators tend to “overestimate” the value of their content “in the grand scheme” of the emerging AI industry. In what sounded like the opening of a tough business negotiation, Zuckerberg suggested content creators may need the AI companies more than the AI companies need them.
“We pay for content when it’s valuable to people. We’re just not going to pay for content when it’s not valuable to people. I think that you’ll probably see a similar dynamic with AI, which my guess is that there are going to be certain partnerships that get made when content is really important and valuable,” Zuckerberg said in an interview with The Verge in late September.
But if those deals don’t get made and creators’ content isn’t used to train the algorithms, Zuckerberg said Meta has no problem leaving them out. “It’s not like that’s going to change the outcome of this stuff that much,” he said.
One early AI decision already getting attention was made in August by another district court judge in San Francisco, William Orrick, in a copyright case against Stability AI and a trio of other image-generating AI companies: Midjourney, DeviantArt, and Runway AI (see here). Orrick denied much of the motions to dismiss the copyright claims against the companies, although the judge substantially narrowed the claims against the image-generating companies by granting their motions to dismiss claims under the Digital Millennium Copyright Act.
One important element of Orrick’s decision (see here), said Kevin Madigan of the Copyright Alliance, is that the judge made clear that AI is an entirely different animal than technologies that were at the center of fair use disputes in the past, such as the video-cassette recorder in the 1980s and the Grokster file-sharing service in the early 2000s.
“Those technologies were not built based on the ingestion of massive amounts of copyrighted works,” Madigan said.
Both fair use litigation over the VCR and Grokster reached the Supreme Court, in 1984 and 2005 respectively. In the first case, Universal City Studios sued Sony, alleging contributory copyright infringement because Sony’s VCRs were being used by consumers to record Universal movies available on free ad-supported television. Universal sought to block Sony from making VCRs, as well as win financial damages, but the Supreme Court established the “Sony safe-harbor” principle that a new technology that enables copying isn’t infringing copyright by design.
Orrick said the case against Stability AI over its Stable Diffusion model is not similar to this earlier VCR case asserting contributory infringement, where Universal had no evidence that Sony intended to induce copyright infringement by manufacturing VCRs.
“Instead, this is a case where plaintiffs allege that Stable Diffusion is built to a significant extent on copyrighted works and that the way the product operates necessarily invokes copies or protected elements of those works,” Orrick wrote, referring to Stability AI allegedly scraping 5 billion images, including the plaintiffs’ copyrighted artwork, to train the model. “The plausible inferences at this juncture are that Stable Diffusion by operation by end users creates copyright infringement and was created to facilitate that infringement by design.”
The question of whether Stable Diffusion AI is infringing by design, unlike the VCR, is not necessary to answer at this early stage of the case, Orrick decided. “Whether true and whether the result of a glitch (as Stability contends) or by design (plaintiffs’ contention) will be tested at a later date,” Orrick wrote.
Please email editors@mlex.com to contact the editorial staff regarding this story, or to submit the names of lawyers and advisers.