top of page

When Data Ownership Becomes a Game of Hide and Seek

  • Writer: Nikita Silaech
    Nikita Silaech
  • Nov 18, 2025
  • 3 min read
Image on Unsplash
Image on Unsplash

Courts have started drawing a line that the AI industry did not see coming. Training a model on copyrighted material is fair use, but obtaining that material through pirated copies is not (Copyright Alliance, 2025). The distinction sounds small. In practice, however, it plays a big role.


The shift appeared first in Judge William Alsup's ruling on Anthropic's copyright case earlier this year. Alsup found that generative AI training itself qualifies as transformative fair use, which seemed like a victory for AI labs. But he then carved out an important exception; if the copyrighted works were obtained through pirate sites like LibGen, the entire analysis changes (McKool Smith, 2025). 


Anthropic may have had legal permission to train on the books, but it did not have legal permission to possess them in the first place.


If AI companies cannot use pirated copies as their source material, even if the ultimate training use might qualify as fair use, then the calculus of how they assemble datasets becomes a legal disaster (McKool Smith, 2025). 


Statutory damages for willful infringement reach $150,000 per work, and some datasets contain hundreds of thousands of copyrighted texts. The potential liability for companies that scraped LibGen, the Tor network, or similar pirate websites is now measured in the billions (Copyright Alliance, 2025).


Data sourcing has, until now, been treated as a side issue. Researchers published papers on model architecture, benchmark performance, and scaling laws while remaining deliberately vague about where the training data actually came from (USC Law, 2025). The industry leaned on a kind of strategic opacity, where as long as the legal question was framed as "is training fair use?," the process of acquisition remained in the background (McKool Smith, 2025). Now that process itself is under scrutiny.


The problem is that legitimate data is expensive and fragmented. A book publisher who licenses content to an AI company usually charges tens of thousands of dollars. Academic publishers charge differently based on use case. Journalists and authors, as individuals, have no standardized licensing mechanism at all (Copyright Alliance, 2025). 


By contrast, a pirated dataset is free, comprehensive, and ready to use. The economic incentive to use pirated data has been overwhelming, especially when the legal assumption was that training counted as fair use regardless (USC Law, 2025).


Anthropic's August settlement this year (the terms remain confidential) sent the message that defending the acquisition methods in front of a jury is riskier than negotiating a fee (McKool Smith, 2025). The company had a defensible argument on the training methodology, but it lost the argument on the sourcing (Copyright Alliance, 2025). 


Other AI labs are probably watching carefully. OpenAI faces similar claims from The New York Times and multiple authors. Meta's LLaMA model was trained on LibGen books scraped without consent (Futurism, 2025). The pattern seems to be industry-wide.


What makes this moment unusual is that the law is not blocking AI training itself. Fair use seems to survive intact, at least in the U.S. courts that have ruled so far. What the law is blocking, or at least making it enormously expensive, is the shortcut that gets you to training with free data (McKool Smith, 2025). If you want to build an AI model, you can. But you have to buy your training data from people who own it.


That constraint changes the cost structure of AI development, fragments the datasets available to different companies based on licensing deals they can afford, and it potentially shifts competitive advantage away from whoever is cheapest toward whoever has the best licensing relationships. It also means that smaller labs or open-source projects may struggle to assemble training data at all if they cannot negotiate publisher agreements (USC Law, 2025).


The settlement signal also reveals something about how courts think about responsibility here. Alsup essentially said that the problem is not that you trained an AI model, the problem is that you treated copyrighted works as if they were yours to acquire however you wanted (McKool Smith, 2025). That framing suggests future liability may extend beyond just training, to anyone in the pipeline who helped assemble pirated datasets knowing what they were doing.


For the industry, the path forward is either licensing the data, building their own data, or negotiating with rights holders individually (USC Law, 2025). The days of treating data sourcing as an invisible step in the model-building process are over. That does not stop AI development. It just changes who can afford it and how much it costs.

Comments


bottom of page