Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mlfoundations Open flamingo Distributed Training Setup

From Leeroopedia
Revision as of 17:33, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Mlfoundations_Open_flamingo_Distributed_Training_Setup.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Template:Metadata

Overview

Infrastructure pattern for initializing multi-GPU and multi-node training using process group backends with support for multiple distributed launchers.

Description

Distributed training initialization requires that each process knows three critical identifiers: its rank (global process index), local_rank (index on the current node), and world_size (total number of processes). The process group (typically NCCL for GPUs) must be initialized before any distributed operations can take place.

OpenFlamingo supports three launch methods:

  • torchrun — uses environment variables (RANK, LOCAL_RANK, WORLD_SIZE) set automatically by the launcher.
  • SLURM — reads job-scheduler variables (SLURM_PROCID, SLURM_LOCALID, SLURM_NTASKS) to derive rank information.
  • Horovod — delegates rank and world-size discovery to the Horovod runtime via hvd.init().

After initialization, the model is wrapped with one of two distributed wrappers:

  • DDP (DistributedDataParallel) — for simple data parallelism where the full model fits on a single GPU.
  • FSDP (FullyShardedDataParallel) — for memory-efficient training of large models by sharding parameters, gradients, and optimizer states across GPUs.

Usage

Apply this pattern when training on multiple GPUs or multiple nodes. Distributed initialization is required before any distributed operation such as model wrapping, distributed sampling, or gradient synchronization.

Theoretical Basis

Data parallelism replicates the model on each GPU and splits the data batch across GPUs. DDP synchronizes gradients via an all-reduce collective after the backward pass, ensuring all replicas have identical updated parameters.

FSDP goes further by sharding model parameters, gradients, and optimizer states across GPUs, gathering them only when needed for forward or backward computation. This drastically reduces per-GPU memory consumption, enabling training of models that would not fit on a single device.

The init_process_group call establishes the communication backend:

  • NCCL — optimized for GPU-to-GPU communication over NVLink and InfiniBand.
  • Gloo — CPU fallback backend for environments without GPU interconnects.

Related Pages

Implementation:Mlfoundations_Open_flamingo_Init_distributed_device

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment