When AI Runs Out of Things to Learn From

Nikita Silaech
6 days ago
4 min read

There is a moment in every growth story where you stop asking “how do we scale” and start asking “where do we get more fuel”. The AI industry is currently at that moment.

Neema Raphael, chief data officer at Goldman Sachs, said it plainly. We have already run out of data. Not running out, but already ran out (Goldman Sachs, 2025).

What he means is that the publicly available text on the internet, everything written by humans and accessible without permission, has been largely exhausted by companies training their models. When GPT-4 was trained, it consumed 300 billion words. That sounds big until you realize that models trained after that have consumed trillions of data points. There is a finite amount of text on the internet. If companies keep training larger models, they will hit that limit.

Researchers estimate that high-quality public data will be completely gone between 2026 and 2032 (Observer, 2024). Some estimates are more aggressive. An MIT study found that 25 percent of the highest-quality data sources already have restrictions in place preventing AI companies from accessing them. OpenAI gets blocked from 26 percent of top-quality sources. Google from 10 percent. Meta from 4 percent (New York Times, 2024).

So what do companies do when they run out of fuel?

They switch to synthetic data. That is, they train new models on the output of existing models, which is what DeepSeek allegedly did to build its cheaper systems. The problem with this approach is that it creates a mathematical problem. Feed a model its own generated content repeatedly, and the model converges toward a mean. The edges disappear and the rare cases vanish. Eventually, the model produces nothing but mediocre results, which then become the training data for the next model, which produces even more mediocre outputs.

This is called model collapse, and researchers first documented it in 2024. OpenAI co-founder Ilya Sutskever compared it to fossil fuels. Human-generated text is a finite resource, and we are treating it like it is infinite. Once it is gone, it is gone.

Some companies are betting that they can avoid this trap by moving to proprietary data. Goldman Sachs pointed out that enterprises sit on vast reserves of information that has never been used for training AI before. These include trading flows, client interactions, internal emails, proprietary research. The advantage is that this data is huge and mostly untapped. The disadvantage is that it belongs to someone, and using it carries legal and ethical complications (New York Times, 2024).

Meta had announced that it will use conversations with users across its platforms to train its models. That is technically data they own, but it is also data that came from billions of people who did not explicitly consent to having their messages used to train AI (LiveScience, 2024). Facebook users did not agree to this when they signed the terms of service.

The harder truth is that we built an AI system that requires more data than humanity produces. We are now facing three options, and none of them are good.

Option one is to keep training on synthetic data and accept that AI models will gradually degrade. This is the path DeepSeek appears to have taken, and it is working so far, but the long-term consequences are unknown. Models trained on AI-generated text produce narrower outputs, repeat themselves, and lose the ability to handle edge cases. If your AI doctor is trained repeatedly on AI-generated medical notes, it might start recommending treatments that sound plausible but have never been tested on a real human.

Option two is to move to proprietary data at scale. This means companies will increasingly use your data, your emails, your messages, your transactions, your browsing history, to train AI, often without explicit consent. This is technically legal in most jurisdictions, but it raises hard questions about privacy, ownership, and what it means to consent to data use. Meta has already started this. Others might follow.

Option three is to invest heavily in human-generated data. That means hiring people to write things, paying them to generate diverse, high-quality content specifically for training AI. This is expensive and slow and counterintuitive in an industry that has been obsessed with automation (Goldman Sachs, 2025). But it might be the only way to keep models from degrading.

The thing that makes this different from other resource constraints is that it is not a shortage people will admit to openly. When you run out of oil, everyone knows it. When you run out of training data, you can hide it by switching to a different training method and hoping nobody notices the degradation. You can announce new models and new capabilities while quietly training on lower-quality data. Users might not realize the difference until they notice their AI is slightly worse at rare cases or edge scenarios.

Companies are releasing new models faster than ever, but the real frontier for AI development has shifted from the public internet to the proprietary databases of large corporations. Whoever controls that data controls the next generation of AI.

Responsible AI Foundation

When AI Runs Out of Things to Learn From

Related Posts

Comments