Fix the inference engine before you patch the RL objective

Engineering · May 6, 2026 · 1 week ago · source (huggingface.co)

ServiceNow's writeup is a careful debugging story with a sharp lesson. Moving their PipelineRL setup from vLLM 0.8.5 to 0.18.1 broke training: clip rate, KL divergence, entropy, and reward all drifted from the known-good reference within the first steps. The cause was not the algorithm but the logprobs the inference engine returned, which the trainer uses to compute policy ratios.

They found four separate issues. The new version returned raw logprobs before temperature and top-k/top-p processing, while the trainer expected processed ones. Some runtime defaults silently changed. Cache invalidation on in-flight weight updates behaved differently. And the final projection layer ran below fp32, which is enough to move logits visibly once they pass through policy ratios and KL. With each fixed, including forcing fp32 on the lm_head and setting logprobs to processed mode, the upgraded run tracked the reference again. The takeaway they draw is the title: fix inference-side correctness before adding objective-side compensations like importance reweighting, because those patches can hide a broken backend and make training dynamics impossible to read.

Why it matters

If you run online RL, this is a concrete reminder that an engine upgrade is part of your correctness surface: numerical precision in logprobs is not a performance detail, and objective hacks layered on a broken backend will mislead you.

Inference Reinforcement Learning