If you’ve ever grabbed a dataset from HuggingFace, glanced at the license field, and moved on with your life, you’re not alone. That’s how most ML engineers handle data licensing. Check the tag, assume it’s correct, start training.
Turns out that assumption is wrong more often than it’s right.
The Audit That Should Worry You
In 2024, the Data Provenance Initiative published the results of a massive audit in Nature Machine Intelligence. A multi-disciplinary team of legal scholars and ML researchers traced the lineage of over 1,800 text datasets commonly used for AI training. They checked where the data actually came from, what licenses applied to the source material, and whether the metadata on hosting platforms matched reality.
The findings were bad.
Over 70% of datasets on popular hosting platforms had missing license information. Over 50% had incorrect license metadata. That’s not a rounding error. That’s the majority of the ecosystem operating on wrong or nonexistent license data.
IEEE Spectrum covered it with the headline “Public AI Training Datasets Are Rife With Licensing Errors.” That about sums it up.
But the headline number actually understates the problem. A follow-up study from the same team, “Bridging the Data Provenance Gap Across Text, Speech and Video”, audited nearly 4,000 datasets across modalities. They found that while less than 33% of datasets are labeled as restrictively licensed, over 80% of the actual source content in widely-used text, speech, and video datasets carries non-commercial restrictions.
The labels say “commercial use OK.” The underlying data says otherwise.
This happens because dataset creators scrape content from multiple sources, slap a permissive license on the resulting collection, and don’t trace whether the individual sources actually allow that. The license on the dataset page reflects what the creator chose, not what the source material permits.
Most ML engineers don’t have the time (or frankly, the legal training) to audit this chain themselves. So they trust the label. And the labels are wrong.
The License Chain Problem
License contamination doesn’t stay contained. It propagates through the ML pipeline in ways that are genuinely hard to track.
Here’s a concrete chain. Follow along and see where the license information gets lost.
Step 1: Someone creates Dataset A using CC-BY-SA content scraped from Stack Overflow. The dataset is uploaded to HuggingFace with an “apache-2.0” tag. The CC-BY-SA share-alike requirement from the source material is already invisible.
Step 2: A team trains Model B on Dataset A. They check the license tag. It says Apache 2.0. They proceed.
Step 3: Another team uses Model B to generate synthetic training data, creating Dataset C. They label it with their own license.
Step 4: Model D trains on Dataset C. By this point, nobody remembers (or even knows) that the original training signal traces back to CC-BY-SA content with a share-alike clause.
This isn’t hypothetical. This is how the open-source ML ecosystem actually works.
Consider the WizardLM family of models. WizardLM and WizardCoder used “Evol-Instruct,” a technique that generates training data by prompting ChatGPT to rewrite and complexify existing instructions. The resulting models were released for “academic research purposes only,” partly because the training data was generated through OpenAI’s API, and OpenAI’s terms of service restrict using outputs to develop competing models. Whether those contractual restrictions actually bind downstream users is a legal question that hasn’t been definitively answered.
The broader point: by the time a model is three or four steps removed from its original training data, the license provenance is effectively lost. Nobody can tell you with confidence what restrictions apply. And a 2024 study evaluating LLMs on license compliance in code generation found that some models score as low as 0.153 on a License Compliance (LiCo) metric, meaning they’re nearly incapable of providing correct license information about the code they generate.
The license chain was never really assembled in the first place.
The EU AI Act: From Sloppy to Illegal
For years, the training data license situation was a known-but-ignored problem. Sloppy, sure. But nobody was enforcing anything.
That changed on August 2, 2025.
The EU AI Act’s general-purpose AI model obligations took effect on that date. If you’re building or deploying GPAI models in the EU, here’s what’s now required:
Training data transparency. Providers must publish a public summary of the content used to train their models, using the European Commission’s mandatory template. Not a vague description. A structured disclosure covering data types, sources, and collection methods.
Copyright compliance. From 2026, developers must check copyright reservations on data sources before training. If a rights holder has opted out, you can’t use their content.
Penalties. The worst violations (prohibited AI practices) carry fines of up to 35 million euros or 7% of global annual turnover, whichever is higher. Other infractions cap at 15 million or 3%. For context, 7% of Google’s 2024 revenue would be roughly $24 billion. These aren’t symbolic fines.
Grandfathering. Models placed on the market before August 2025 have until August 2027 to comply. That’s not a grace period for new development. It’s a countdown for existing models.
Major AI companies are already signing the GPAI Code of Practice, a voluntary compliance framework that functions as a bridge to the mandatory requirements. Meta notably declined to sign, arguing the rules stifle innovation. But declining the voluntary code doesn’t exempt you from the mandatory obligations. It just means you don’t get the benefit of the doubt when enforcement starts.
The European Commission can begin enforcement actions from August 2026. That’s six months from now.
If your training data pipeline can’t produce a compliant summary of what data you used, where it came from, and what licenses apply, you have a problem that’s measured in euros with a lot of zeros.
The Courts Are Deciding
While Europe builds regulatory infrastructure, US courts are deciding the legal questions the hard way: case by case.
Thomson Reuters v. Ross Intelligence (February 2025). The first US court ruling on fair use in AI training, and fair use lost. Ross used Thomson Reuters’ Westlaw headnotes to train a competing legal research tool. The court found this wasn’t transformative because the output served the same market as the input. Key takeaway: domain-specific training on commercial content, used to build a competing product, is high-risk territory.
Bartz v. Anthropic (June 2025). Fair use won, but with a catch. Judge Alsup found that training LLMs on copyrighted books was “spectacularly” transformative because the model doesn’t reproduce the books, it extracts statistical patterns. But the same ruling found that using pirated copies of those books was not fair use, even if the eventual training was transformative. You can train on copyrighted material. You can’t steal it first.
NYT v. OpenAI (ongoing). The big one. The New York Times sued OpenAI for training on millions of copyrighted articles. The case survived OpenAI’s motion to dismiss and is proceeding toward trial, with summary judgment briefing expected to wrap by April 2026. In January 2026, the court compelled OpenAI to produce 20 million ChatGPT conversation logs as evidence. This case will likely set the most significant precedent for whether large-scale training on copyrighted content qualifies as fair use.
Fastcase v. Alexi Technologies (November 2025). A concrete example of data license enforcement. Fastcase licensed its legal database to Alexi for “internal research purposes.” Alexi then used that data to train a commercial AI product that competed directly with Fastcase. When confronted, Alexi’s lawyers admitted they’d used the data as training data and intended to keep doing so. The court denied Alexi’s emergency motion to restore access. This is what happens when license terms actually get enforced.
The pattern is clear. Courts aren’t issuing blanket rulings for or against AI training. They’re looking at specifics: Was the use transformative? Did it compete with the original? Were the copies obtained lawfully? Was there a license, and was it followed?
“We didn’t check” is not a defense that’s aging well.
What Compliance Actually Requires
So what does a legally defensible training data pipeline look like? It needs four things that most teams currently lack.
Provenance tracking. Where did each sample come from? Not “HuggingFace” or “the internet.” The actual source URL, the original publisher, the date it was collected. If you can’t trace a training sample back to its origin, you can’t determine what restrictions apply to it.
License verification. What license does the source material carry? Not the dataset label on HuggingFace. The actual license on the content that was scraped, downloaded, or transcribed to create the dataset. As the Data Provenance Initiative showed, these two things rarely match.
Chain-of-custody documentation. Can you trace from a model’s output back through its training data to the original source? If your model is three generations removed from the original data (trained on synthetic data, generated by a model, trained on a dataset, scraped from the web), can you reconstruct that chain? The EU AI Act’s training data summary requirement is essentially asking for this.
Copyright reservation compliance. Under the EU’s text and data mining exceptions, rights holders can opt out of having their content used for AI training. From 2026, you need a process for checking and respecting those opt-outs. That means monitoring robots.txt, checking for machine-readable copyright reservations, and actually honoring them.
Most teams today have none of this infrastructure. License compliance is a checkbox on a dataset card, filled in by whoever uploaded the data, never verified. Provenance is “I downloaded it from this URL.” Chain of custody doesn’t exist.
Building this from scratch is genuinely hard. It’s a data engineering problem, a legal problem, and an operational problem all at once. There’s no off-the-shelf solution that handles all of it.
But “it’s hard” is not going to be a compelling argument when a regulator asks for your training data summary.
The Clean Data Strategy
There’s a simpler approach to this problem, at least for a significant portion of training data needs.
Government data published by US federal agencies is generally public domain under federal law. Works produced by the US government aren’t eligible for copyright protection domestically. When your training data comes from federal sources, you skip the entire license audit. There’s no license chain to trace because there’s no license to violate.
That’s how federal data has always worked.
The same logic applies to many international government sources. The World Bank publishes data under CC-BY 4.0 with clear attribution requirements. Many national statistical offices publish under similar open terms. The licensing is unambiguous and documented at the source.
OpenData focuses on exactly these government sources: BLS, Census, FRED, EPA, Treasury, World Bank, and others. Each dataset tracks its source URL back to the original federal agency. That’s not full “provenance tracking” in the sense of chain-of-custody verification across a multi-step training pipeline. But it’s a clean foundation where the licensing question has a clear, documented answer.
For ML teams worried about training data compliance, government data offers something rare in this ecosystem: certainty. You know where it came from. You know the license status. You can point a regulator at the source and say, “This is public domain federal data, here’s the agency, here’s the URL.”
That won’t cover every training data need. You’re not going to build a creative writing model exclusively on Census data. But for quantitative models, economic analysis, environmental research, demographic studies, and a long list of other applications, government data is both high-quality and unambiguously licensed.
The easiest way to solve the training data license problem is to not have one. Start with data that’s clean by default.
What Happens Next
The next twelve months will reshape how the AI industry thinks about training data. The EU AI Act enforcement begins in August 2026. The NYT v. OpenAI case approaches summary judgment. More courts will weigh in on fair use. The GPAI Code of Practice will start distinguishing compliant companies from non-compliant ones.
If you’re building models today, the time to get your data provenance in order is before enforcement starts, not after. Audit what you’re training on. Trace where it came from. Verify the licenses. And where possible, build on data sources where the licensing is unambiguous from the start.
The “grab data, check the tag, hope for the best” era is winding down.
OpenData is an open-source platform for accessing public government datasets via a simple API. Browse available datasets at opendata.place or check out the source on GitHub.