Hugging Face cut 22 percent off generation time by overlapping CPU and GPU

Engineering · May 14, 2026 · 3 days ago · source (huggingface.co)

The Hugging Face transformers team published a walkthrough of a problem that is easy to ignore until you measure it. In standard continuous batching, the CPU prepares the next batch while the GPU sits idle, then the GPU computes while the CPU waits. The two never overlap, and in their setup that wasted about 24 percent of generation time with the GPU doing nothing.

The fix is asynchronous batching, and the post is worth reading because it shows the mechanics rather than just the idea: separate CUDA streams for host-to-device transfer, compute, and device-to-host transfer; CUDA events to order work across streams without blocking the CPU; and double-buffered input and output slots so the CPU can fill batch N+1 while the GPU runs batch N, sharing one CUDA graph memory pool so VRAM does not double. On an 8B model generating 8K tokens at batch size 32, total time dropped from 300.6 to 234.5 seconds, a 22 percent speedup, with GPU utilization rising from 76 to 99.4 percent. At 5 dollars an hour for an H200, that is roughly 0.59 dollars saved per hour of generation.

Why it matters

If you run your own inference, this is a concrete, code-level pattern you can copy: the streams-and-double-buffering recipe recovers near-idle GPU time on any workload with alternating CPU and GPU phases, not just token generation.

Inference Hugging Face