Tag: Evaluation

AI agents now finish 16 percent of real freelance jobs, up from 2.5 (safe.ai)

AI · 20 hours ago · July 15, 2026
Why price per million tokens tells you almost nothing (janilowski.pl)

AI · 1 week ago · July 7, 2026
GeneBench-Pro shows AI still stumbles on real biology analysis (openai.com)

AI · 1 week ago · July 5, 2026
Using DSPy to find a hidden bug in an agent's prompt (simonwillison.net)

AI · 1 week ago · July 3, 2026
A new benchmark shows agents rebuilding software that would take humans weeks (epoch.ai)

AI · 2 weeks ago · June 30, 2026
An open model beat Claude Code at finding access-control bugs (semgrep.dev)

Security · 2 weeks ago · June 29, 2026
An API change that helps big models can break small ones (huggingface.co)

AI · 3 weeks ago · June 21, 2026
GLM-5.2 takes the lead among open-weights models (artificialanalysis.ai)

AI · 3 weeks ago · June 18, 2026
OpenAI predicts model behavior by replaying real conversations (openai.com)

AI · 3 weeks ago · June 18, 2026
AI2's olmo-eval brings statistical rigor to the model training loop (huggingface.co)

AI · 1 month ago · June 13, 2026
Did Claude make rsync buggier? The numbers say no (alexispurslane.github.io)

AI · 1 month ago · June 5, 2026
Microsoft ASSERT turns plain-English behavior specs into AI tests (techcrunch.com)

Engineering · 1 month ago · June 4, 2026
Princeton researchers pick apart Google's claim that AI agents built an OS for $916 (normaltech.ai)

AI · 1 month ago · May 30, 2026
A new benchmark sends frontier agents to fix Kubernetes, and none pass (huggingface.co)

AI · 1 month ago · May 27, 2026
An open leaderboard for whole agent systems, not just models (huggingface.co)

AI · 1 month ago · May 19, 2026
Ai2 launches a shared benchmark for AI climate models (allenai.org)

AI · 1 month ago · May 18, 2026
The Open ASR Leaderboard adds private tests to stop gaming (huggingface.co)

AI · 1 month ago · May 18, 2026
Testing AI in the open world, not just on benchmarks (normaltech.ai)

AI · 1 month ago · May 17, 2026
How you pick benchmarks decides whether open models are far behind (interconnects.ai)

AI · 2 months ago · May 16, 2026
One benchmark number hides which jobs a model is actually good at (interconnects.ai)

AI · 2 months ago · April 20, 2026
DeepMind proposes a cognitive framework for measuring AGI progress (blog.google)

AI · 4 months ago · March 17, 2026
Crash Testing GPT-4: The First Dangerous-Capability Eval (asteriskmag.com)

AI · 3 years ago · June 1, 2023