Why AI's Measured Progress Isn't Real Progress
- Nikita Silaech
- Nov 14, 2025
- 3 min read

We're living through a very fast-paced era in AI development. Every month, a new model announcement arrives with benchmark scores climbing upward. Training costs to reach frontier capability now exceed $100 million, with forecasts predicting the largest models will cost over a billion dollars by 2027 (Epoch AI, 2024).
Meanwhile, enterprise ROI from AI initiatives sits at just 5.9%, with only 10% of organizations reporting significant returns from AI deployments (Deloitte, 2025).
This gap between rising capability scores and falling real-world returns often goes unnoticed. The problem isn't that models aren't improving, but that we're probably measuring the wrong thing, and optimizing toward those measurements is creating a kind of progress trap.
For instance, look at what happened with SWE-bench, a dataset designed to measure a model's ability to solve real software engineering problems from GitHub. Performance jumped from 4.4% to 71.7% in a single year (Stanford HAI, 2025).
But this benchmark doesn't reflect what happens when you put these models into actual engineering workflows. The benchmark problems are simplified and de-contextualized versions of real work. They lack the context that makes a real engineering problem solvable: unclear requirements, legacy code dependencies, and undocumented systems. A model can ace the test environment without acing the actual task.
When measurement becomes the goal, models can easily learn to exploit the statistical quirks of specific benchmarks. They're essentially taught to the test, which is efficient for something trivial like winning leaderboards, but not a reliable tool for more important tasks.
Recent research has found that many benchmarks dismissed as "saturated," including GSM8K (frontier models that claim to solve at 95% accuracy), come with label errors and ambiguity (Vendrow et al, 2025).
When researchers changed the benchmarks, model performance dropped dramatically. This shows that the frontier models hadn't actually solved the tasks, just learned to navigate the dataset's flaws.
The performance gap between the top model and the tenth-best on standard leaderboards has shrunk from 11.9% two years ago to 5.4% now (Stanford HAI, 2025). This seems like competitive saturation, but it might actually reflect something different. Models are all learning the same statistical patterns and converging on the same benchmark exploits. Differentiation is becoming hard not because they’re really good but because they’re all optimizing toward the same measure.
Organizations are increasing AI spending out of strategy and necessity, and fear of falling behind matters as much as actual capability to them. But leaders are narrowing their focus to high-confidence use cases, emphasizing speed and scalability over innovation (Deloitte, 2025).
Generative AI is already delivering measurable productivity improvements in some areas, but the ROI timelines are still long and uncertain. Among organizations using agentic AI, half expect returns within three years, and another third expect five years or more (Deloitte, 2025).
The cost structure is also a big issue. Training a frontier model now costs tens to hundreds of millions of dollars. Right now, for most applications, we can’t measure if they are value for money at that price.
What the data shows us is a widening gap between the industry’s definition of progress and the market’s need for value. As long as benchmark scores are the primary measure of success, the incentive will be to build models that are good at passing tests, not necessarily at solving real-world problems. The result is an arms race to the top of a leaderboard that has less and less to do with the practical challenges of using AI in real-world scenarios.





Comments