A new benchmark sends frontier agents to fix Kubernetes, and none pass
Artificial Analysis and IBM Research published ITBench-AA, a benchmark that asks frontier models to act as on-call SREs and root-cause Kubernetes incidents from logs, alerts, traces, metrics, and topology. The task is concrete: read what a real cluster looks like during a failure, submit a structured JSON diagnosis, and earn points only when you name every true cause without inventing extras. The dataset has 59 incidents (40 public, 19 held out), each replayed in an open-source Stirrup harness with sandboxed shell access and a 100-turn cap. Scoring is average precision at full recall, so missing a single ground-truth cause drops the score for that task to zero.
The headline number is that nobody crosses 50%. Claude Opus 4.7 leads at 47% (about $5.38 per task in Max Effort), GPT-5.5 follows at 46%, and Qwen3.7 Max sits at 42%. On the open-weights side, GLM-5.1 and Gemini 3.5 Flash both land at 40%, DeepSeek V4 Pro at 38%, and Gemma 4 31B at 37% for $0.14 per task, which makes it the cost-performance pick. A more interesting finding hides in the turn counts. Gemini 3.1 Pro Preview averages 83 turns and scores 30%; Gemma 4 31B uses fewer turns and scores higher. Over-investigating tends to surface false positives like chaos-mesh controllers or upstream mechanisms, which then poison the precision score.
The benchmark is designed to stay unsaturated, with held-out tasks and a stiff scoring rule. It also publishes a public leaderboard, the harness, and the dataset, so teams can run their own models with the same protocol rather than trusting vendor claims.
Why it matters
If your team is sizing up AI for incident response, this is the first benchmark that will not flatter the models you are considering. Read the cost column with the score column: a $0.14 model that beats a $2 model on the same SRE task is the actual story, not the absolute number at the top of the leaderboard.