DeepSeek-V4 spends most of its design budget making long context usable

AI · April 24, 2026 · 3 weeks ago · source (huggingface.co)

DeepSeek's release writeup makes an argument most context-window announcements skip: a million tokens is capacity, and the hard part is making that capacity cheap enough that an agent can actually use it across a long tool-use run. V4 comes in two mixture-of-experts sizes, Pro at 1.6 trillion total and 49 billion active parameters, and Flash at 284 billion total and 13 billion active, both with a 1M context.

The interesting numbers are about cost, not size. At a 1M-token context, V4-Pro runs single-token inference at 27 percent of the previous generation's FLOPs and uses about 10 percent of its KV cache, with Flash down to 7 percent, roughly 2 percent of what standard grouped-query attention would need. That comes from interleaving two compression schemes, a 4x compressed sparse attention and a 128x heavily compressed attention, with FP8 storage for cache entries. There is also an agent-specific choice: reasoning traces are kept across message boundaries when tool calls are present, so the chain of thought accumulates over a multi-step task instead of resetting each turn. On benchmarks it sits near the frontier, with SWE Verified at 80.6 resolved against Opus 4.6's 80.8.

Why it matters

If you build long-running agents, the cache and FLOPs figures are the spec to read: they decide whether a million-token window is affordable in production, and DeepSeek is shipping the recipe as open weights.

Open Models LLM Architecture