One benchmark number hides which jobs a model is actually good at
Nathan Lambert's argument in this essay is that collapsing the open-versus-closed question into one benchmark number throws away the only information that matters: which specific capabilities a model is strong at. Benchmarks correlate weakly with how a model behaves once deployed, and the field's evaluation focus shifts every twelve to eighteen months, so any single score is measured during a moving target.
His sharpest example is Gemini 3, which posted excellent benchmarks yet, he writes, showed "remarkable irrelevance" in agent deployment, the place where these tools are actually being put to work. He also traces how the center of gravity has moved: from chat and math right after ChatGPT, to coding and agentic tasks, and now toward specialized knowledge work in accounting, law, and healthcare. The catch is that the data needed to improve on those newer domains is increasingly private, unlike the public GitHub code that drove coding gains, which makes honest evaluation a research problem in itself.
The piece is a useful corrective to leaderboard reflexes without pretending benchmarks are worthless.
Why it matters
If you compare models for real work, build your own task-specific evaluation on the job you actually do: public scores are drifting away from the private, specialized work where models now compete, and a top-line number can point you at the wrong model.