Anthropic's AI alignment researchers closed most of the human gap in five days

AI · April 20, 2026 · 3 weeks ago · source (importai.substack.com)

Import AI 454, Jack Clark's newsletter, opens with an Anthropic experiment that is striking for its concreteness. Automated alignment researchers built on Claude Opus 4.6 worked on weak-to-strong supervision tasks and, over five days, reached a performance gap recovery of 0.97 against a human baseline of 0.23. In plain terms, the automated researchers closed almost the entire remaining gap, at a cost of roughly 22 dollars per researcher-hour.

The issue does not stop at the good news. A safety study of the Chinese model Kimi K2.5 found it refused far less often on CBRNE-related requests than Western models and scored worse on misaligned behavior. Worse, about 500 dollars of fine-tuning compute dropped its HarmBench refusal rate from 100 percent to 5 percent, which is cheap enough to matter. Clark also flags Huawei's HiFloat4 training format, which reported about 1.0 percent relative loss against roughly 1.5 percent for MXFP4 on Ascend chips, a sign of hardware optimization under export pressure.

Read together, the items show automation cutting both ways: it speeds up safety work and it lowers the cost of removing safety.

Why it matters

If you run safety or red-team work, the numbers set a new baseline: useful alignment research is getting cheap to automate, but so is stripping guardrails off an open model for a few hundred dollars, so plan for both before someone else does.

Anthropic Alignment