Principle:Mlfoundations Open flamingo Distributed Training Setup
Overview
Infrastructure pattern for initializing multi-GPU and multi-node training using process group backends with support for multiple distributed launchers.
Description
Distributed training initialization requires that each process knows three critical identifiers: its rank (global process index), local_rank (index on the current node), and world_size (total number of processes). The process group (typically NCCL for GPUs) must be initialized before any distributed operations can take place.
OpenFlamingo supports three launch methods:
- torchrun — uses environment variables (
RANK,LOCAL_RANK,WORLD_SIZE) set automatically by the launcher. - SLURM — reads job-scheduler variables (
SLURM_PROCID,SLURM_LOCALID,SLURM_NTASKS) to derive rank information. - Horovod — delegates rank and world-size discovery to the Horovod runtime via
hvd.init().
After initialization, the model is wrapped with one of two distributed wrappers:
- DDP (
DistributedDataParallel) — for simple data parallelism where the full model fits on a single GPU. - FSDP (
FullyShardedDataParallel) — for memory-efficient training of large models by sharding parameters, gradients, and optimizer states across GPUs.
Usage
Apply this pattern when training on multiple GPUs or multiple nodes. Distributed initialization is required before any distributed operation such as model wrapping, distributed sampling, or gradient synchronization.
Theoretical Basis
Data parallelism replicates the model on each GPU and splits the data batch across GPUs. DDP synchronizes gradients via an all-reduce collective after the backward pass, ensuring all replicas have identical updated parameters.
FSDP goes further by sharding model parameters, gradients, and optimizer states across GPUs, gathering them only when needed for forward or backward computation. This drastically reduces per-GPU memory consumption, enabling training of models that would not fit on a single device.
The init_process_group call establishes the communication backend:
- NCCL — optimized for GPU-to-GPU communication over NVLink and InfiniBand.
- Gloo — CPU fallback backend for environments without GPU interconnects.
Related Pages
Implementation:Mlfoundations_Open_flamingo_Init_distributed_device