OpenAI o1 and the Start of Test-Time Reasoning
OpenAI introduced o1 in September 2024 as a model trained with reinforcement learning to produce an internal chain of thought before it answers. The claim that mattered was about scaling. OpenAI says o1's performance improves both with more reinforcement learning during training and with more time spent thinking at inference, which is a different lever from making the base model bigger. The benchmark numbers OpenAI reported were strong for the time: o1 placed among roughly the top 500 students on the AIME math qualifier, ranked in the 89th percentile on Codeforces competitive programming, and exceeded the accuracy of PhD-level humans on GPQA, a graduate science question set. The model was slower and more expensive per answer, and OpenAI was reasonably direct that the gains come with a latency cost. Within months DeepSeek, Qwen, and Google shipped their own reasoning models chasing the same idea.
Why it matters
This release opened the reasoning-model era and the practice of spending more compute at inference instead of only at training. If you choose models or design systems, test-time reasoning is now a knob you budget for, and o1 is where that tradeoff entered production. The pattern, not the specific model, is the thing to understand.