Streaming DiLoCo

· February 13, 2025

note-to-self

[2501.18512] Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch

Every paper in this series has been required reading in (very) large language model training. The basic theme is that model training requires gang-semantics, where a large cluster of accelerators need to do coordinated work together in order to make progress, which gets progressively more expensive to enable and harder to do reliably as the number of devices in the cluster increases.

The prior papers explored ways of splitting up the training into an inner loop where the model trained fairly traditionally, and an outer optimization loop that aggregated the differences and updated based on them – the outer optimizer works on the deltas between parameter values at the sync point. The outer optimizer still runs on the same cluster as all the inner loops, but it means that only at the “outer” sync point do you need to do synchronization between all the devices. This loosens the coupling between devices and allows introducing failure domains.

This paper addresses the challenge that when you do synchronize you still have to send data for all the parameters, which requires a lot of bandwidth and can block forward progress. Streaming DiLoCo divides the model layers into different shards and syncs those at different times (in practicality, ever 5 inner optimizer steps), lowering the peak bandwidth required. They take shards in a strided fashion rather than sequentially to mildly improve stability and performance.

To further reduce bandwidth, the communication between devices for the outer loop is done in 4-bit floating point! They still do the accumulations/optimization in 32 bit, but they didn’t see any performance loss when using the lower bit rate for comms. All of these comms are overlapped with the inner loop training, which helps minimize stalls.