Tülu 3 Opens Up the Post-Training Recipe

AI · November 21, 2024 · 1 year ago · source (allenai.org)

Most of what turns a raw pretrained model into something useful happens in post-training, and most labs keep that part closed. Ai2's Tülu 3, released in November 2024, opened it. The family ships at 8B and 70B with all of the data, the data mixes, the training code, the infrastructure, and an evaluation framework. The recipe combines three stages: supervised fine-tuning on curated and synthetic instruction data, on-policy preference learning in the DPO family, and a reinforcement learning stage Ai2 calls reinforcement learning with verifiable rewards. That last piece is notable because it is an early public formalization of using automatically checkable rewards, for tasks like math and code where an answer can be verified, rather than a learned preference model. Ai2 also released decontaminated skill datasets and intermediate checkpoints so the stages can be studied separately. You can read the full writeup on the Ai2 blog.

Why it matters

Reinforcement learning with verifiable rewards became central to the reasoning-model wave that followed, and Tülu 3 is one of the first places the method and its data were laid out in the open. If you do post-training, this is a working recipe to compare against instead of guessing what the closed labs do.

Open Models Post-training