NYT v. OpenAI: The Copyright Suit Putting AI Training's Fair-Use Defense on Trial
When The New York Times filed its copyright complaint in federal court on December 27, 2023, the paper's lawyers opened with a claim that cut to the heart of the modern AI industry: that OpenAI and Microsoft had copied millions of Times articles—painstakingly reported, edited, and fact-checked over 172 years of publishing—to train the GPT family of large language models, and that ChatGPT could reproduce those articles nearly word-for-word on demand, threatening the subscription revenue the Times depends on to fund its journalism.
The case, formally captioned The New York Times Company v. Microsoft Corporation et al., S.D.N.Y. No. 1:23-cv-11195, is the most consequential copyright lawsuit of the AI era—not only because of the plaintiff, but because the doctrinal question it poses has no established answer: is feeding billions of copyrighted words into a neural network a "transformative use" protected by the fair-use doctrine, or an act of infringement at industrial scale?
The Case at a Glance
The 69-page complaint, filed December 27, 2023, names OpenAI LP, OpenAI Inc., several subsidiary entities, and Microsoft Corporation as defendants. The core allegation is two-fold. First, that OpenAI unlawfully copied Times articles into training datasets for its GPT-2, GPT-3, GPT-3.5, and GPT-4 models—acts of reproduction on OpenAI servers before a user ran a single query. Second, that when ChatGPT and Microsoft Copilot generate responses, they sometimes reproduce Times content verbatim or near-verbatim in ways that substitute for visiting the original, paywalled article.
The complaint's most striking exhibit—Exhibit J—showed dozens of side-by-side comparisons: a ChatGPT output column matched against the original Times article, with word-for-word matches running to hundreds of words. One comparison reproduced a Pulitzer Prize–winning investigative piece on restaurant-industry abuse to a degree the Times argued made the AI output a functional substitute for the paywalled original. These examples became the factual anchor of the Times' market-harm argument.
The Two Theories of Infringement
Copyright lawyers watching the case have identified two legally distinct theories the Times is pressing simultaneously—courts have not resolved either at LLM scale.
- The ingestion theory. Every time OpenAI copied an article into a training corpus, it created an unauthorized reproduction of a copyrighted work under 17 U.S.C. § 106(1). That the copy was later "consumed" in a training forward pass does not eliminate the infringement. On this theory, OpenAI's fair-use defense must succeed for the training process itself, not just for model outputs.
- The output theory. Even if training qualifies as fair use, when ChatGPT reproduces a Times article verbatim in response to a user prompt, that output is itself an infringement—a reproduction and public display of the copyrighted work without a license, delivered to a user who might otherwise have paid for access.
OpenAI contested both. For the ingestion theory, it argues that training is transformative under Authors Guild v. Google, 804 F.3d 202 (2d Cir. 2015), which held that Google's full-text scanning of books for its search index was a fair use because it added informational value without displaying full text to end users. For the output theory, OpenAI contends verbatim reproduction is a rare "bug" the company actively suppresses—not an intended product feature—and that the Times engineered Exhibit J through adversarial prompt manipulation designed to surface edge-case failures.
The New Doctrine Being Tested
Authors Guild v. Google is the spine of OpenAI's defense, but its application to large language models is legally untested. The Second Circuit emphasized in 2015 that Google's index never displayed full text to users—it returned short snippets to direct users back to sources. ChatGPT does something categorically different: it synthesizes new text that can, under certain prompts, reproduce training material at length and without attribution. Whether that distinction makes a language model more analogous to a photocopier than a search engine is one of two novel doctrinal questions the case must answer.
The second novel question is what copyright scholars call the memorization problem. Neural networks trained on text do not store verbatim copies of documents the way a database does—they distribute information across billions of floating-point weights. Yet, as the Times demonstrated in Exhibit J, these models can reproduce passages with high fidelity when prompted correctly. Courts have never determined whether a model that statistically "memorizes" training data and can reproduce it on demand violates the reproduction right under § 106, because no technology before transformer-scale LLMs worked this way. The answer will shape how every future model is trained and licensed.
The Thomson Reuters Precedent (February 2024)
While the NYT case remains in active litigation, a related case produced what is currently the most significant AI-copyright ruling from any U.S. court. In Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence Inc., decided February 11, 2024, by the U.S. District Court for the District of Delaware, a federal judge ruled that Ross Intelligence's use of Westlaw legal headnotes to train an AI-powered legal research tool was not a fair use—the first U.S. judicial holding that training an AI on copyrighted material can constitute infringement rather than a protected transformation.
The court weighed all four fair-use factors against Ross. Factor 4—the effect on the potential market for the original work—was decisive: Ross was building a direct competitor to Westlaw, which licenses its content for exactly this kind of derivative commercial use. The court flatly rejected Ross's argument that training a machine to perform legal research is transformative compared to using legal text for legal research. The ruling is not binding on the S.D.N.Y., but it is the only extant U.S. judicial analysis of AI training and fair use, and it cuts sharply against OpenAI's market-harm position.
The Publisher Landscape: Who's Settling, Who's Fighting
The Times has chosen litigation where others chose licensing. By late 2024, OpenAI had reached reported content-licensing agreements with Axel Springer (Politico, Business Insider), the Associated Press, The Atlantic, Vox Media, The Financial Times, News Corp (covering The Wall Street Journal and The New York Post), and The Guardian. Every deal is confidential in its precise terms.
Separately, a coalition of major U.S. book authors—including John Grisham, George R. R. Martin, Jodi Picoult, and Jonathan Franzen—filed a class action against OpenAI in September 2023 through the Authors Guild, also in S.D.N.Y., tracking many of the same legal theories. That case and the Times suit are widely discussed as a coordinated content-industry challenge to the economic model underpinning foundation model development at scale.
What a Ruling Either Way Would Mean
A ruling for the Times would force OpenAI—and every AI company that trained on scraped web data—to either license copyrighted training content retroactively or face substantial statutory-damages exposure. Under 17 U.S.C. § 504(c)(2), willful infringement carries a maximum of $150,000 per work. The Times' archives span well over ten million articles; even a fraction adjudicated as infringed could produce theoretical exposure far exceeding OpenAI's estimated annual revenue of roughly $3.4 billion (late 2024, per media reporting on internal projections). The practical effect of a Times victory would likely be a mandatory licensing regime—potentially a statutory compulsory license—that restructures the economics of every future training run.
A ruling for OpenAI would establish a broad transformative-use safe harbor for AI training, freeing every foundation model developer from retroactive infringement risk and likely accelerating a fresh competitive round from companies that held back while awaiting the outcome. Congress has discussed a compulsory licensing framework analogous to music's mechanical license under 17 U.S.C. § 115, but no AI-training bill has advanced out of committee as of current reporting.
Where the Case Stands
The court denied OpenAI's partial motion to dismiss in early 2024, allowing the core copyright-infringement claims to proceed to discovery. The parties have been engaged in significant discovery disputes: the Times demanded access to OpenAI's training data logs, documentation of which crawl sources contributed to GPT-4's training corpus, and internal communications about whether OpenAI engineers knew the model could reproduce training content verbatim. OpenAI has resisted broad disclosure on trade-secret grounds, leading to ongoing magistrate-level proceedings. No trial date has been set. The case's complexity—requiring expert testimony on neural network architecture, journalism economics, and fair-use doctrine—makes an early resolution without settlement unlikely, though settlement talks, if any, have not been disclosed publicly.
Frequently asked
What is OpenAI's primary legal defense in the NYT case?
Can the Times actually collect $150,000 per infringed article?
What did the Thomson Reuters v. Ross Intelligence ruling actually decide?
Why have major publishers settled with OpenAI rather than sue?
Does Section 230 of the Communications Decency Act protect OpenAI from these claims?
Sources & further reading
- The New York Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work
- OpenAI's Response to The New York Times Lawsuit
- New York Times Sues OpenAI, Microsoft for Copyright Infringement (Reuters)
- Thomson Reuters Wins AI Copyright Case Against Ross Intelligence (Reuters, Feb. 2024)
- Authors Guild Files Class Action Lawsuit Against OpenAI
- What the New York Times–OpenAI Lawsuit Means for Publishers (Nieman Lab)
Last reviewed May 10, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.