← all news

Decoupled DiLoCo trains a 12B model across four regions, 20x faster

Engineering · · 3 weeks ago · source (deepmind.google)

Large-model training normally needs identical chips kept in near-perfect lockstep, which gets harder the more chips you add. Decoupled DiLoCo breaks that requirement. It splits training into independent islands of compute that exchange data asynchronously, so a failure in one island does not stall the rest. It builds on two earlier ideas, Pathways for asynchronous distribution and the original DiLoCo for low inter-datacenter bandwidth.

The numbers are the argument. Cross-datacenter bandwidth drops from 198 Gbps to 0.84 Gbps across eight datacenters. Under high failure rates, useful throughput holds at 88% where standard methods fall to 27%. DeepMind trained a 12-billion-parameter model across four US regions about 20x faster than the conventional approach, and reports benchmark quality matching the baseline despite the architectural change. The result is training over internet-grade links and mixed hardware generations instead of one custom-built cluster. Read the full post on DeepMind's blog.

Why it matters

If compute access is your bottleneck, this loosens the constraint that training needs one giant tightly-coupled cluster. The 88% versus 27% throughput-under-failure gap is the number to track, because it decides whether distributed, heterogeneous training is merely possible or actually practical.

TrainingGoogle DeepMind