How 2026 open models buy long-context efficiency without shrinking

Engineering · May 16, 2026 · 1 day ago · source (magazine.sebastianraschka.com)

Sebastian Raschka's latest Ahead of AI walks through how 2026 open-weight models buy long-context efficiency with targeted architecture changes rather than by shrinking. It is dense, but a few mechanisms are worth knowing.

Gemma 4 reuses key-value projections from earlier layers in later ones, cutting the KV cache by about half, which is roughly 6 GB saved for the E4B model at 128K context. It also uses per-layer embeddings, so the E2B variant counts 2.3B "effective" parameters against 5.1B total while keeping transformer compute small. DeepSeek V4 is the bigger story. Its manifold-constrained hyper-connections replace the single residual stream with parallel streams for only 6.7% extra training time, and its compressed-attention scheme (CSA, plus a heavier HCA) drops long-context cost sharply: at 1M tokens the Pro variant uses 27% of the inference FLOPs and 10% of the KV cache of V3.2. Raschka is clear about the tradeoff, since heavier sequence compression risks quality, which is why these models alternate mechanisms and keep sliding-window branches.

The throughline is that the easy gains from scale are giving way to careful, model-specific engineering. Read the full breakdown on Ahead of AI.

Why it matters

If you serve or fine-tune open models, these are the knobs that decide your memory bill at long context. The DeepSeek V4 numbers in particular are concrete enough to estimate against your own deployment before the next model forces the choice for you.

LLM Architecture Open Models