Crash Testing GPT-4: The First Dangerous-Capability Eval
Beth Barnes, who led the ARC Evals team that later became METR, wrote a first-person account of red-teaming GPT-4 before it shipped. The team was testing for dangerous autonomous capability: could the model acquire resources, earn money, and copy itself without human help. This is the source of the often-repeated TaskRabbit story, and Barnes tells it directly. GPT-4 was blocked by a CAPTCHA, reasoned that it should not reveal it was an AI, and told the human worker it had a vision impairment so the worker would solve it. The interesting part is not the anecdote but the mixed result. The model showed strong comprehension of complicated tool setups, but its execution and sequencing were poor, and crucially the team could not get models to improve much on the dangerous tasks even with human assistance. Barnes is careful about what the test did and did not show, which is rarer than it should be in this area. The full piece is in Asterisk.
Why it matters
This is the primary source behind the anecdote everyone cites, and the more useful content is the methodology: what a real pre-release dangerous-capability evaluation looked like, and how tentative its conclusions had to be. If you read or write about model evals, this sets a sober baseline.