Environment:Mlfoundations Open flamingo PyTorch CUDA Distributed

Knowledge Sources	OpenFlamingo PyTorch Distributed
Domains	Infrastructure, Distributed_Training
Last Updated	2026-02-08 03:30 GMT

Overview

Linux environment with PyTorch 2.0.1, CUDA-capable GPU, and NCCL backend for distributed training via SLURM or torchrun.

Description

This environment provides the GPU-accelerated distributed training and inference context for OpenFlamingo. It requires PyTorch 2.0.1 with CUDA support and uses the NCCL backend for multi-GPU communication. The distributed initialization supports three backends: SLURM (via `srun`), torchrun (`torch.distributed.launch`), and optionally Horovod. Single-GPU execution is also supported with automatic fallback to `cuda:0` or CPU.

Usage

Use this environment for all Distributed Training, Few-Shot Evaluation, and Model Inference workflows that require GPU acceleration. It is the mandatory prerequisite for running any multi-GPU training via FSDP or DDP, and for distributed evaluation across multiple GPUs.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	SLURM integration requires Linux; Conda env specifies `openjdk`
Hardware	NVIDIA GPU with CUDA support	Multi-GPU recommended; single GPU supported
VRAM	16GB+ per GPU	Required for 3B+ parameter models with FSDP
Distributed	SLURM or torchrun	SLURM uses `srun --ntasks-per-node=8 --gpus-per-task=1`

Dependencies

System Packages

CUDA toolkit (compatible with PyTorch 2.0.1)
NCCL (default distributed backend)

Python Packages

`torch` == 2.0.1
`torchvision`
`numpy`

Credentials

The following environment variables are used for distributed setup (set automatically by SLURM or torchrun):

`LOCAL_RANK`: Local rank of the process on the node
`RANK`: Global rank of the process
`WORLD_SIZE`: Total number of processes
`SLURM_PROCID`: SLURM process ID (SLURM only)
`SLURM_NTASKS`: Total SLURM tasks (SLURM only)
`SLURM_LOCALID`: SLURM local ID (SLURM only)
`MASTER_ADDR`: Master node address (set in launch script)
`MASTER_PORT`: Master node port (set in launch script)
`WANDB_MODE`: Set to `offline` when `--offline` flag is used
`TRANSFORMERS_OFFLINE`: Set to `1` when `--offline` flag is used

Quick Install

# Install core PyTorch with CUDA
pip install torch==2.0.1 torchvision

# For SLURM launch (example from run_train.sh)
srun --ntasks-per-node=8 --gpus-per-task=1 python train.py --dist-backend nccl

# For torchrun launch
torchrun --nproc_per_node=8 train.py --dist-backend nccl

Code Evidence

CUDA device detection and distributed init from `open_flamingo/train/distributed.py:73-132`:

def init_distributed_device(args):
    args.distributed = False
    args.world_size = 1
    args.rank = 0
    args.local_rank = 0
    if args.horovod:
        assert hvd is not None, "Horovod is not installed"
        hvd.init()
        ...
    elif is_using_distributed():
        if "SLURM_PROCID" in os.environ:
            # DDP via SLURM
            args.local_rank, args.rank, args.world_size = world_info_from_env()
            torch.distributed.init_process_group(
                backend=args.dist_backend,
                init_method=args.dist_url,
                world_size=args.world_size,
                rank=args.rank,
            )
        else:
            # DDP via torchrun, torch.distributed.launch
            args.local_rank, _, _ = world_info_from_env()
            torch.distributed.init_process_group(
                backend=args.dist_backend, init_method=args.dist_url
            )
    ...
    if torch.cuda.is_available():
        if args.distributed and not args.no_set_device_rank:
            device = "cuda:%d" % args.local_rank
        else:
            device = "cuda:0"
        torch.cuda.set_device(device)
    else:
        device = "cpu"

Environment variable detection for distributed backends from `open_flamingo/train/distributed.py:40-70`:

def is_using_distributed():
    if "WORLD_SIZE" in os.environ:
        return int(os.environ["WORLD_SIZE"]) > 1
    if "SLURM_NTASKS" in os.environ:
        return int(os.environ["SLURM_NTASKS"]) > 1
    return False

def world_info_from_env():
    local_rank = 0
    for v in ("LOCAL_RANK", "MPI_LOCALRANKID", "SLURM_LOCALID",
              "OMPI_COMM_WORLD_LOCAL_RANK"):
        if v in os.environ:
            local_rank = int(os.environ[v])
            break
    ...

SLURM launch script from `open_flamingo/scripts/run_train.sh:1-12`:

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-task=1
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=15000

Common Errors

Error Message	Cause	Solution
`Horovod is not installed`	`--horovod` flag set without horovod package	Install horovod or use default NCCL backend
NCCL timeout / hang	Firewall blocking inter-node communication	Ensure `MASTER_ADDR` and `MASTER_PORT` are accessible across nodes
`CUDA out of memory`	Model too large for available VRAM	Use `--fsdp` flag, reduce batch size, or enable `--gradient_checkpointing`
Single-GPU fallback to CPU	No CUDA device detected	Verify CUDA toolkit installation and `nvidia-smi` output

Compatibility Notes

Horovod: Optional alternative to NCCL. Detected via `OMPI_COMM_WORLD_RANK` / `PMI_RANK` environment variables. Requires explicit `--horovod` flag.
SLURM vs torchrun: Both are supported. SLURM is detected via `SLURM_PROCID` environment variable; torchrun via `LOCAL_RANK`.
Single GPU: Falls back to `torch.distributed.init_process_group` with `world_size=1, rank=0` even for single-GPU runs.
CPU-only: Supported but not recommended for training. Device falls back to `"cpu"` when `torch.cuda.is_available()` returns False.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment