How you pick benchmarks decides whether open models are far behind

AI · May 16, 2026 · 1 day ago · source (interconnects.ai)

The Center for AI Standards and Innovation published an evaluation concluding that open models lag the American frontier and that the gap is widening over time. Florian Brand and Nathan Lambert read the underlying numbers and found the headline depends heavily on three tests where DeepSeek V4 scored poorly: CTF-Archive-Diamond, PortBench, and ARC-AGI-2. Lean on those and the gap looks large. Use Epoch AI's ECI metric instead and the gap has stayed roughly three to seven months behind since R1, which is a very different story.

The post also walks through a wave of recent open releases, including Gemma 4 in several sizes, DeepSeek-V4 in Flash and Pro variants, Kimi-K2.6, MiMo-V2.5-Pro, and GLM-5.1. The pattern they draw out is that coding benchmarks run without the development harness a model was trained for understate real capability, so methodology choices, not just model quality, move the conclusion.

The value here is the audit. Rather than repeat the official takeaway, they show which specific benchmarks carry it and what changes when you swap the yardstick.

Why it matters

If a benchmark result is about to drive a procurement or policy decision, check which tests produced it before you act: the same models can look either stuck or close depending on the metric, and that choice is now doing real work in government assessments.

Open Models Evaluation