Sebastian Raschka's map of inference-time scaling for reasoning

Engineering · January 24, 2026 · 3 months ago · source (magazine.sebastianraschka.com)

Sebastian Raschka's piece is a map, not a manifesto. It sorts the training-free ways to trade extra compute at inference for better answers into six categories: chain-of-thought prompting, self-consistency, best-of-N ranking, rejection sampling with a verifier, self-refinement, and search over solution paths. The point that ties them together is simple: none of these touch the model weights, they just spend more time and tokens when the question is hard.

The concrete anchor is a worked example where these techniques move a base model from roughly 15% to 52% accuracy through tuning across thousands of runs. Raschka places the trend against OpenAI's o1 as the model that made this approach mainstream, and notes that the major providers now use some form of it. The article is a survey of recent literature rather than a look inside any vendor's implementation, and it is honest about that scope.

Read the full breakdown on Ahead of AI.

Why it matters

If you are deciding where to spend a compute budget, this is the vocabulary to reason with. The 15% to 52% jump shows the upside is large, but each category has a different cost and failure mode, so naming them is the first step to picking one deliberately rather than by default.

LLM Architecture Reasoning