Jack Clark sorts attacks on AI agents into six kinds

AI · April 13, 2026 · 1 month ago · source (importai.substack.com)

Import AI 453, Jack Clark's newsletter, leads with a taxonomy that is useful because it is organized by what an attack targets rather than how it is delivered. The six kinds are content injection, which targets perception; semantic manipulation, which targets reasoning; cognitive-state attacks, which target memory and learning; behavioural control, which targets the action the agent takes; systemic attacks, which target multi-agent dynamics; and human-in-the-loop attacks, which target the person supervising. Laid out this way, the point is hard to miss: defending the model alone leaves most of the surface untouched.

The same issue includes a capability marker worth noting. In the MirrorCode evaluation, Claude Opus 4.6 reimplemented gotree, a bioinformatics toolkit of roughly 16,000 lines of Go that researchers estimate would take a human two to seventeen weeks. Clark also notes that Ryan Greenblatt doubled his estimate, from 15 to 30 percent, for fully automated AI research and development by the end of 2028. The disempowerment section is more speculative and reads as a survey of arguments rather than a claim.

Why it matters

If you deploy agents, the taxonomy is a practical checklist: map your defenses against all six targets, because injection filters do nothing for attacks aimed at memory, coordination, or the human in the loop.

Agents Security