What Anthropic Learned Building a Multi-Agent Researcher

AI · June 13, 2025 · 11 months ago · source (anthropic.com)

Anthropic's engineering team described the production system behind its research feature, and the useful part is the numbers, not the architecture diagram. The setup is orchestrator and worker: a Claude Opus 4 lead agent plans and spawns Claude Sonnet 4 subagents that work in parallel. On Anthropic's internal research evaluation this beat a single Opus 4 agent by 90.2 percent. It is not free. The team says agents already use about four times the tokens of chat, and multi-agent runs use about fifteen times, and that token usage alone explains roughly 80 percent of the performance variance on these browsing tasks. So the decision rule is economic: multi-agent earns its cost on high-value tasks that parallelize, and is wasteful otherwise. Two other concrete findings: having Claude rewrite its own tool descriptions cut task completion time by 40 percent for later agents, and they evaluate with an LLM-as-judge rubric scoring factual accuracy, citation accuracy, completeness, source quality, and tool efficiency. They are candid that the last mile, durable execution and deployment without disrupting running agents, was most of the work.

Why it matters

If you are weighing a multi-agent design, this gives you a real cost multiple and a threshold instead of vibes. The 15x token figure and the 80 percent variance result are the things to put in your own estimate before you build.

Anthropic Agents