Morning Edition LIVE
Vol. I · No. 1
Est.
MMXXVI

The A.I. Beat

Dispatches from the frontier of machine intelligence
Three
Dollars
← Front page Regulation May 3, 2026 · 7 min read
Regulation

AI and Copyright: The Billion-Dollar Question of Who Owns What

The NYT is suing OpenAI. Getty won a default judgment against Stability. Congress is drafting new statutes. Here's the full legal, technical, and economic picture of AI copyright in 2026.
AI and Copyright: The Billion-Dollar Question of Who Owns What

The most consequential legal question in the AI industry is not about safety, alignment, or existential risk. It is about money. Specifically: did AI companies commit the largest act of copyright infringement in history when they scraped the internet to build training datasets, or did they perform a transformative act of learning that falls squarely within fair use?

Billions of dollars in litigation, licensing deals, and future business models hinge on how courts answer that question. And as of May 2026, they have not finished answering it.

The Fundamental Tension

The economic logic of modern AI training creates a structural conflict between creators and model developers that no amount of goodwill can fully resolve.

This cycle is not hypothetical. Stock photography revenue at Getty Images fell 12% year-over-year in 2025, a decline the company attributed in part to AI image generators trained on its library. Freelance writing rates on major content platforms dropped 20-30% between 2023 and 2025, according to survey data from the Freelancers Union, as clients shifted to AI-generated first drafts.

The AI companies’ counterargument — that models learn patterns and concepts rather than copying specific works, analogous to a human studying examples — has some legal support in the concept of transformative use. But it has not yet survived a full trial on the merits.

The Major Cases

The litigation landscape is sprawling. Here are the cases that will shape the law.

The NYT case is the bellwether

The New York Times v. OpenAI is the most closely watched case because it has the strongest fact pattern for plaintiffs. The Times demonstrated that ChatGPT could reproduce near-verbatim passages from its articles — not paraphrases, not summaries, but text matching word-for-word for paragraphs at a time. OpenAI has argued this is a cherry-picked edge case that can be mitigated with guardrails, not evidence of systematic copying.

The case will likely turn on two questions: whether training itself constitutes copying (the reproduction right under 17 U.S.C. 106(1)), and whether the outputs are “substantially similar” to training data in a legally meaningful sense.

The Getty default judgment

The UK case produced a remarkable outcome: Stability AI failed to file a defense in the High Court, resulting in a default judgment in Getty’s favor in February 2025. The practical impact was limited — Stability AI argued the UK entity was non-operational — but the precedent is significant. A UK court has now affirmed, at least procedurally, that training an image model on a licensed image library without permission can constitute infringement under UK law.

What’s in the Training Data?

The composition of major AI training datasets has become public through litigation discovery, research papers, and investigative journalism. The picture is roughly as follows:

The “web crawl” category is where most of the copyright tension lives. Common Crawl, the most widely used web corpus, contains billions of pages scraped from across the internet — including copyrighted news articles, blog posts, creative writing, and product descriptions. When researchers at the Washington Post analyzed the C4 dataset (a cleaned version of Common Crawl used to train Google’s T5 and many subsequent models), they found that the top 10 domains included nytimes.com, latimes.com, theguardian.com, forbes.com, and washingtonpost.com itself.

The Books3 dataset — 196,640 books scraped from the Bibliotik pirate library — was used to train Meta’s LLaMA, Bloomberg’s BloombergGPT, and likely others. Its existence was documented by researcher Shawn Presser and became a centerpiece of the Authors Guild litigation.

The Copyright Office has issued three substantive guidance documents on AI:

  1. Registration guidance (February 2023, updated March 2023): AI-generated content is not copyrightable because it lacks human authorship. However, a human who “selects or arranges AI-generated material in a sufficiently creative way” may copyright that selection and arrangement. The decision to cancel the copyright registration of the AI-generated comic book Zarya of the Dawn (retaining protection only for the human-authored text and arrangement) remains the leading precedent.

  2. Notice of inquiry on training (August 2023): The office received over 10,000 public comments on whether training on copyrighted works should require permission. It has not issued a final determination.

  3. Report to Congress (July 2025): Recommended that Congress create a statutory licensing framework for AI training data, modeled loosely on the compulsory licensing regime for musical compositions under 17 U.S.C. 115. No legislation has been enacted as of May 2026, though the Schumer-Rounds AI Copyright Act (S. 2847) is in committee.

The Opt-Out Landscape

In the absence of clear law, a patchwork of technical and contractual opt-out mechanisms has emerged:

robots.txt: The Robots Exclusion Protocol, a 30-year-old standard for web crawlers, has been repurposed for AI. Major publishers now include directives like User-agent: GPTBot / Disallow: / in their robots.txt files. A 2025 study by Originality.ai found that over 35% of the top 1,000 websites now block at least one AI crawler. The legal enforceability of robots.txt is untested — it is a voluntary standard, not a legal instrument.

ai.txt: The Spawning.ai “ai.txt” proposal (an AI-specific companion to robots.txt) has seen limited adoption. It allows site operators to specify whether their content can be used for training, for inference, or not at all.

Data licensing deals: The market for authorized training data has exploded. OpenAI has signed licensing agreements with the Associated Press, Axel Springer, Le Monde, Prisa Media, Dotdash Meredith, and others — reportedly paying between $1 million and $10 million annually per publisher, depending on archive size and exclusivity. Google has signed similar deals for Gemini training. Anthropic and Meta have been quieter about licensing but are understood to have agreements in place.

The economic math is stark: the total cost of licensing a training corpus at publisher-negotiated rates would run into the tens of billions of dollars, which is why AI companies have fought hard for the fair use defense rather than conceding that licensing is required.

The Doe v. GitHub case raises questions specific to software. When GitHub Copilot generates code, it sometimes produces output that matches open-source code from its training data. This creates a novel problem: if the training code was licensed under GPL (which requires derivative works to also be GPL), does Copilot-generated code carry that license obligation?

The implications are significant:

  • GPL-licensed training data: If Copilot outputs are derivative works of GPL code, any proprietary software incorporating that output could be in violation of the GPL. This is the “copyleft infection” scenario that corporate legal departments have long feared.
  • MIT/Apache-licensed code: These permissive licenses require attribution but allow proprietary use. If Copilot generates code originally written under MIT license without including the copyright notice, that is arguably a license violation.
  • Unlicensed code: Code without an explicit license is copyrighted by default. Training on it may be infringement.

In practice, most companies using Copilot and similar tools have adopted a pragmatic approach: use AI-generated code for boilerplate and non-critical paths, conduct license scanning on the output (tools like FOSSA and Snyk now offer AI-generated code scanning), and avoid using AI-generated code for core proprietary IP.

Emerging Compromises

The industry is gravitating toward several mechanisms that attempt to bridge the creator-model developer divide:

Revenue sharing: YouTube’s approach to AI music — allowing AI-generated songs that reference existing artists’ styles while sharing revenue with rights holders — is a model being watched closely. Spotify has experimented with similar frameworks for AI-generated audio content.

Content credentials: The Coalition for Content Provenance and Authenticity (C2PA) standard embeds cryptographic metadata in files to establish provenance. Adobe, Microsoft, the BBC, and others have adopted it. While not a copyright solution per se, it creates an evidentiary trail that supports attribution and enforcement.

Collective licensing bodies: The Copyright Clearance Center (CCC) in the US and the Authors’ Licensing and Collecting Society (ALCS) in the UK are developing blanket licensing frameworks for AI training, similar to how ASCAP and BMI license music for public performance.

Synthetic data: Some labs are shifting toward training on synthetic data — content generated by AI models themselves — to reduce dependence on copyrighted works. The legal footing of this approach is untested, and there are technical concerns about “model collapse” (degraded performance when models train on their own outputs).

What to Expect

The NYT v. OpenAI case will likely produce a ruling in late 2026 or 2027 that establishes the first major judicial precedent on AI training and fair use. Regardless of the outcome, expect an appeal and eventual Supreme Court review — this is too consequential for the lower courts to settle definitively.

In the meantime, the practical advice remains conservative:

For creators: Register your copyrights (statutory damages require it). Implement technical opt-outs. Track where your work appears in AI outputs. Consider licensing proactively — the rates are better now than they will be after a court ruling, regardless of which way it goes.

For companies using AI-generated content: Do not assume it is copyrightable. Do not use it for core IP. Maintain records of what was human-authored versus AI-generated. The major AI providers (OpenAI, Google, Microsoft, Amazon, Anthropic) now offer copyright indemnification clauses in their enterprise agreements — use them.

For developers: Treat AI-generated code the way you treat code from Stack Overflow: useful, but verify the license before you ship it. For anything that touches core product IP, write it yourself.

The law will catch up with the technology. It always does. The question is how much damage — to creators, to AI companies, to the public interest in both innovation and creative expression — accumulates in the gap.

regulation copyright industry