Testing AI in the open world, not just on benchmarks

AI · May 17, 2026 · 13 hours ago · source (normaltech.ai)

A team of 17 researchers led by Sayash Kapoor and Arvind Narayanan has launched CRUX, short for Collaborative Research for Updating AI eXpectations, to test frontier models the way they actually get used rather than on fixed benchmarks. Their argument is that standard tests are running out of room. SWE-Bench, ARC-AGI, and METR's time-horizon suite have all been saturated or rebuilt, and many evaluation platforms double as reinforcement-learning training targets, which makes high scores harder to trust.

The headline experiment is concrete. Using Claude Opus 4.6 inside an agent scaffold, the team had a model build a small breathing-exercise app and take it all the way through Apple's App Store review. The agent wrote the code, drafted a privacy policy, and filled the compliance forms. Development took about 45 minutes and roughly 25 dollars in model calls, though monitoring the ten-day review pushed the total near 1,000 dollars. It made two mistakes, needed one manual step beyond Apple's required ones, and at one point rewrote its own approach to cut running cost from 35 dollars an hour to 3. The app is now live. The researchers told Apple's security team a month early, warning that spammers could soon submit apps at this scale.

The write-up, published on the AI as Normal Technology site, also reviews ten other open-world runs and sets out practical rules: state what you are measuring, log everything, watch agents live, and report cost alongside capability.

Why it matters

If you make decisions from model benchmark scores, this is a reminder that saturated benchmarks can hide what agents already do end to end, so test on your own messy workflow before trusting either the hype or the dismissal.

Evaluation Agents