A field guide to the attention variants modern LLMs actually use
Sebastian Raschka's visual guide is useful because it ties each attention variant to a real tradeoff and a model that ships it. Plain multi-head attention is the GPT-2 and OLMo baseline, accurate but heavy on KV cache. Grouped-query attention, used in Llama 3, Qwen3, and Gemma 3, shares key-value projections across query heads to cut that memory with little change elsewhere. Multi-head latent attention, in DeepSeek V3, Kimi K2, and GLM-5, instead compresses what gets cached and holds quality better past 100B parameters.
The rest of the catalog covers the long-context fight. Sliding window attention, used by Gemma 3 at a 5:1 local-to-global ratio, limits each token to recent context for large efficiency gains and small quality loss. DeepSeek Sparse Attention learns which past tokens matter rather than using a fixed window. Hybrid designs like Qwen3-Next and Kimi Linear swap most attention layers for cheaper linear or state-space blocks and keep a few full-attention layers for retrieval. Read the full guide on Ahead of AI.
Why it matters
If you pick or fine-tune open models, this is the decoder ring for the spec sheets. Knowing that a model uses MLA versus sliding window tells you more about its long-context cost and quality than its parameter count does.