Decoupled DiLoCo trains a 12B model across four regions, 20x faster
Large-model training normally needs identical chips kept in near-perfect lockstep, which gets harder the more chips you add. Decoupled DiLoCo breaks that requirement. It splits training into independent islands of compute that exchange data asynchronously, so a failure in one island does not stall the rest. It builds on two earlier ideas, Pathways for asynchronous distribution and the original DiLoCo for low inter-datacenter bandwidth.
The numbers are the argument. Cross-datacenter bandwidth drops from 198 Gbps to 0.84 Gbps across eight datacenters. Under high failure rates, useful throughput holds at 88% where standard methods fall to 27%. DeepMind trained a 12-billion-parameter model across four US regions about 20x faster than the conventional approach, and reports benchmark quality matching the baseline despite the architectural change. The result is training over internet-grade links and mixed hardware generations instead of one custom-built cluster. Read the full post on DeepMind's blog.
Why it matters
If compute access is your bottleneck, this loosens the constraint that training needs one giant tightly-coupled cluster. The 88% versus 27% throughput-under-failure gap is the number to track, because it decides whether distributed, heterogeneous training is merely possible or actually practical.