NYT v. OpenAI: The Copyright Suit Putting AI Training's Fair-Use Defense on Trial

AI News Published May 10, 2026 · copyright law · fair use · openai · microsoft · llm training

When The New York Times filed its copyright complaint in federal court on December 27, 2023, the paper's lawyers opened with a claim that cut to the heart of the modern AI industry: that OpenAI and Microsoft had copied millions of Times articles—painstakingly reported, edited, and fact-checked over 172 years of publishing—to train the GPT family of large language models, and that ChatGPT could reproduce those articles nearly word-for-word on demand, threatening the subscription revenue the Times depends on to fund its journalism.

The case, formally captioned The New York Times Company v. Microsoft Corporation et al., S.D.N.Y. No. 1:23-cv-11195, is the most consequential copyright lawsuit of the AI era—not only because of the plaintiff, but because the doctrinal question it poses has no established answer: is feeding billions of copyrighted words into a neural network a "transformative use" protected by the fair-use doctrine, or an act of infringement at industrial scale?

The Case at a Glance

The 69-page complaint, filed December 27, 2023, names OpenAI LP, OpenAI Inc., several subsidiary entities, and Microsoft Corporation as defendants. The core allegation is two-fold. First, that OpenAI unlawfully copied Times articles into training datasets for its GPT-2, GPT-3, GPT-3.5, and GPT-4 models—acts of reproduction on OpenAI servers before a user ran a single query. Second, that when ChatGPT and Microsoft Copilot generate responses, they sometimes reproduce Times content verbatim or near-verbatim in ways that substitute for visiting the original, paywalled article.

The complaint's most striking exhibit—Exhibit J—showed dozens of side-by-side comparisons: a ChatGPT output column matched against the original Times article, with word-for-word matches running to hundreds of words. One comparison reproduced a Pulitzer Prize–winning investigative piece on restaurant-industry abuse to a degree the Times argued made the AI output a functional substitute for the paywalled original. These examples became the factual anchor of the Times' market-harm argument.

The Two Theories of Infringement

Copyright lawyers watching the case have identified two legally distinct theories the Times is pressing simultaneously—courts have not resolved either at LLM scale.

The ingestion theory. Every time OpenAI copied an article into a training corpus, it created an unauthorized reproduction of a copyrighted work under 17 U.S.C. § 106(1). That the copy was later "consumed" in a training forward pass does not eliminate the infringement. On this theory, OpenAI's fair-use defense must succeed for the training process itself, not just for model outputs.
The output theory. Even if training qualifies as fair use, when ChatGPT reproduces a Times article verbatim in response to a user prompt, that output is itself an infringement—a reproduction and public display of the copyrighted work without a license, delivered to a user who might otherwise have paid for access.

OpenAI contested both. For the ingestion theory, it argues that training is transformative under Authors Guild v. Google, 804 F.3d 202 (2d Cir. 2015), which held that Google's full-text scanning of books for its search index was a fair use because it added informational value without displaying full text to end users. For the output theory, OpenAI contends verbatim reproduction is a rare "bug" the company actively suppresses—not an intended product feature—and that the Times engineered Exhibit J through adversarial prompt manipulation designed to surface edge-case failures.

The New Doctrine Being Tested

Authors Guild v. Google is the spine of OpenAI's defense, but its application to large language models is legally untested. The Second Circuit emphasized in 2015 that Google's index never displayed full text to users—it returned short snippets to direct users back to sources. ChatGPT does something categorically different: it synthesizes new text that can, under certain prompts, reproduce training material at length and without attribution. Whether that distinction makes a language model more analogous to a photocopier than a search engine is one of two novel doctrinal questions the case must answer.

The second novel question is what copyright scholars call the memorization problem. Neural networks trained on text do not store verbatim copies of documents the way a database does—they distribute information across billions of floating-point weights. Yet, as the Times demonstrated in Exhibit J, these models can reproduce passages with high fidelity when prompted correctly. Courts have never determined whether a model that statistically "memorizes" training data and can reproduce it on demand violates the reproduction right under § 106, because no technology before transformer-scale LLMs worked this way. The answer will shape how every future model is trained and licensed.

The Thomson Reuters Precedent (February 2024)

While the NYT case remains in active litigation, a related case produced what is currently the most significant AI-copyright ruling from any U.S. court. In Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence Inc., decided February 11, 2024, by the U.S. District Court for the District of Delaware, a federal judge ruled that Ross Intelligence's use of Westlaw legal headnotes to train an AI-powered legal research tool was not a fair use—the first U.S. judicial holding that training an AI on copyrighted material can constitute infringement rather than a protected transformation.

The court weighed all four fair-use factors against Ross. Factor 4—the effect on the potential market for the original work—was decisive: Ross was building a direct competitor to Westlaw, which licenses its content for exactly this kind of derivative commercial use. The court flatly rejected Ross's argument that training a machine to perform legal research is transformative compared to using legal text for legal research. The ruling is not binding on the S.D.N.Y., but it is the only extant U.S. judicial analysis of AI training and fair use, and it cuts sharply against OpenAI's market-harm position.

The Publisher Landscape: Who's Settling, Who's Fighting

The Times has chosen litigation where others chose licensing. By late 2024, OpenAI had reached reported content-licensing agreements with Axel Springer (Politico, Business Insider), the Associated Press, The Atlantic, Vox Media, The Financial Times, News Corp (covering The Wall Street Journal and The New York Post), and The Guardian. Every deal is confidential in its precise terms.

Estimate, not confirmed: Industry analysts estimated individual deal values in the range of $1–$10 million per publisher per year, depending on archive size and audience scale. No party has disclosed exact figures; these estimates are inferred from deal announcement language and media trade reporting as of mid-2024 and should not be treated as sourced figures.

Separately, a coalition of major U.S. book authors—including John Grisham, George R. R. Martin, Jodi Picoult, and Jonathan Franzen—filed a class action against OpenAI in September 2023 through the Authors Guild, also in S.D.N.Y., tracking many of the same legal theories. That case and the Times suit are widely discussed as a coordinated content-industry challenge to the economic model underpinning foundation model development at scale.

What a Ruling Either Way Would Mean

A ruling for the Times would force OpenAI—and every AI company that trained on scraped web data—to either license copyrighted training content retroactively or face substantial statutory-damages exposure. Under 17 U.S.C. § 504(c)(2), willful infringement carries a maximum of $150,000 per work. The Times' archives span well over ten million articles; even a fraction adjudicated as infringed could produce theoretical exposure far exceeding OpenAI's estimated annual revenue of roughly $3.4 billion (late 2024, per media reporting on internal projections). The practical effect of a Times victory would likely be a mandatory licensing regime—potentially a statutory compulsory license—that restructures the economics of every future training run.

A ruling for OpenAI would establish a broad transformative-use safe harbor for AI training, freeing every foundation model developer from retroactive infringement risk and likely accelerating a fresh competitive round from companies that held back while awaiting the outcome. Congress has discussed a compulsory licensing framework analogous to music's mechanical license under 17 U.S.C. § 115, but no AI-training bill has advanced out of committee as of current reporting.

Conjecture, marked clearly: The "10 million articles × $150,000" framing describes the arithmetic ceiling under § 504(c)(2) for willful infringement, not any realistic damages scenario. Courts routinely award a fraction of the statutory maximum; infringement must be proven article-by-article with proper registration; and cases of this magnitude almost invariably resolve through settlement, where the actual figure is determined by negotiation rather than adjudication. OpenAI's ~$3.4 billion revenue figure comes from media reporting on internal projections as of late 2024 and has not been independently audited.

Where the Case Stands

The court denied OpenAI's partial motion to dismiss in early 2024, allowing the core copyright-infringement claims to proceed to discovery. The parties have been engaged in significant discovery disputes: the Times demanded access to OpenAI's training data logs, documentation of which crawl sources contributed to GPT-4's training corpus, and internal communications about whether OpenAI engineers knew the model could reproduce training content verbatim. OpenAI has resisted broad disclosure on trade-secret grounds, leading to ongoing magistrate-level proceedings. No trial date has been set. The case's complexity—requiring expert testimony on neural network architecture, journalism economics, and fair-use doctrine—makes an early resolution without settlement unlikely, though settlement talks, if any, have not been disclosed publicly.

Frequently asked

What is OpenAI's primary legal defense in the NYT case?

OpenAI argues that training an AI on copyrighted text is a "transformative use" protected by the fair-use doctrine under 17 U.S.C. § 107. The company relies heavily on Authors Guild v. Google (2d Cir. 2015), which held that full-text scanning of books for a search index was fair use because it added new informational value. OpenAI also contends that the near-verbatim reproduction examples the Times presented in Exhibit J were engineered through adversarial prompting—edge-case failures—and that the company actively works to reduce model memorization of training data.

Can the Times actually collect $150,000 per infringed article?

The $150,000 statutory ceiling under 17 U.S.C. § 504(c)(2) applies only when infringement is proven willful, and the Times must also show each article was registered with the U.S. Copyright Office before the infringement occurred—a prerequisite for statutory damages. Courts have wide discretion to award far less than the cap. More practically, cases of this magnitude almost always settle; actual recovery, if any, would be determined through negotiation rather than a jury verdict. The headline figure functions more as settlement leverage than a realistic award prediction.

What did the Thomson Reuters v. Ross Intelligence ruling actually decide?

In February 2024, a Delaware federal court held on summary judgment that Ross Intelligence's use of Westlaw legal headnotes to train an AI legal research tool was copyright infringement, not fair use. All four fair-use factors weighed against Ross, with market substitution—Ross was building a direct Westlaw competitor—being decisive. It is the first U.S. court ruling to hold that AI training on copyrighted material is not automatically protected by fair use, and it provides a persuasive (though non-binding in New York) reference point for the NYT case.

Why have major publishers settled with OpenAI rather than sue?

Publishers weigh the certainty of licensing revenue against the cost, duration, and outcome-uncertainty of multi-year litigation. A deal provides immediate, predictable income and preserves the working relationship with a major AI distributor. Some publishers also value having their content included in AI training because it can drive citations and audience traffic. Smaller publishers generally lack the legal resources the Times has dedicated to this case, making litigation a worse option for them economically.

Does Section 230 of the Communications Decency Act protect OpenAI from these claims?

Almost certainly not in this context. Section 230 shields platforms from liability for content created by third-party users. The Times' claim is that OpenAI itself—not any user—created infringing outputs by training on Times content. Courts have consistently held that Section 230 does not protect a platform's own content creation, and the prevailing view among copyright scholars is that Section 230 provides no defense against the Times' copyright-infringement theories.

Sources & further reading

Last reviewed May 10, 2026. AI Pulled News is editorial; corrections welcome at /news/about.html.