← Writing

The model treadmill

GPT-5.4 dropped four days ago with native computer-use. Before that, Gemini 3.1 Pro on February 19. Before that, Claude Opus 4.6 on February 5, the “agent teams” release, 1M context, 14-hour task horizon, the one where the demos showed an agent managing a cluster of sub-agents like a small startup. Before that, Grok 4.20 with its multi-agent setup. Before that, a dozen other releases I’ve already half-forgotten.

It’s March 2026 and we’re getting a new frontier model roughly every two weeks. That’s the treadmill. I wrote last summer that benchmarks are theater. I was right, and it’s gotten worse.

the benchmark problem, evolved

The piece I wrote in August was about the gap between leaderboard numbers and actual usefulness. That gap still exists. But now there’s a second problem layered on top of it: the benchmarks themselves are gone.

MMLU was saturated. Then the hard reasoning evals got saturated. Then the coding benchmarks. Then the ones invented specifically to be hard got saturated within a cycle or two. The SWE-bench Pro data from last September showed top models at roughly 23%, which sounds low until you realize the benchmark was designed to be hard, the models had been trained with it in mind, and 23% was still being called a breakthrough. Every new model crushes every prior benchmark, which at some point stops being impressive and starts being definitional: a benchmark that gets crushed by the next release is just a temporary threshold, not a measure of anything durable.

You could ask: name a benchmark that didn’t get crushed by the next model drop. The honest answer is none of them. Not one has held for two consecutive major releases. That’s not a sign the models are improving (or not only that). It’s a sign the benchmarks stopped being independent measurements and became part of the marketing apparatus.

So the number goes up, the discourse runs its 48-hour cycle, and then everyone goes back to their editors.

the actual problem for builders

Here’s what the treadmill costs you: time, certainty, and the temptation to rebuild.

Every release comes with real capability claims. Some of them are true. Opus 4.6’s long context is genuinely useful for certain tasks; the agent-teams architecture isn’t just theater. GPT-5.4’s native computer-use is a real capability change, not just a score. The problem is you can’t tell in advance which claims will hold up for your work, and “rebuild your integration to try the new thing” has a fixed tax regardless of the outcome.

If you rebuilt your agent stack on every major release over the past eight months, you’ve paid that tax five or six times. Most of those rebuilds probably didn’t meaningfully change your outputs. Maybe one did. You won’t know which one until after you’ve done all of them.

The cumulative cost is months of refactoring chasing a moving target. That’s not a strategy, it’s exhaustion with extra steps.

the only posture that works

I maintain a lean agent, OpenSAM, on provider-agnostic plumbing. OpenRouter handles model routing. Swapping the underlying model is a config line, not a rewrite. That architecture didn’t happen because I was clever; it happened because I got tired of paying the refactor tax.

The posture is simple: ignore the leaderboard. Watch for capability changes that are specific and real. Not “better at benchmarks” but “now does X you actually need.” When one of those lands, spend an afternoon testing it against your actual workload. If the output is meaningfully better, flip the config line. If it isn’t, wait.

The models genuinely are getting better. Not at the rate the announcement cadence implies, and not in the ways the benchmark sheets highlight, but they are. The long-context improvements matter if you’re doing multi-file edits or long agent runs. The reasoning improvements show up in places where the old model would confidently go sideways. Computer-use might matter if your workflow touches a browser. These are real, specific things. They’re findable in twenty minutes of honest testing.

What doesn’t matter is whether model X scored higher than model Y on an eval you’ve never run, for a task you don’t do, on a leaderboard that will be out of date by the time you finish reading the announcement post.

staying still while the treadmill runs

The exhausting part isn’t the models. It’s the discourse around them. Every two weeks there’s a new “the game has changed” take, a new set of demo videos, a new claim about the benchmark that finally means something. Most of it is noise. All of it wants your attention.

The discipline I’ve landed on is the same one I wrote about last September in a different context: don’t react to the feed, react to the artifact. Run the model on the thing you’re actually stuck on. If the output is better, use it. If it isn’t, close the tab.

The leaderboard will still be there when you get back. It’ll just have different numbers.