Environment:Mlfoundations Open flamingo PyTorch CUDA Distributed
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-08 03:30 GMT |
Overview
Linux environment with PyTorch 2.0.1, CUDA-capable GPU, and NCCL backend for distributed training via SLURM or torchrun.
Description
This environment provides the GPU-accelerated distributed training and inference context for OpenFlamingo. It requires PyTorch 2.0.1 with CUDA support and uses the NCCL backend for multi-GPU communication. The distributed initialization supports three backends: SLURM (via `srun`), torchrun (`torch.distributed.launch`), and optionally Horovod. Single-GPU execution is also supported with automatic fallback to `cuda:0` or CPU.
Usage
Use this environment for all Distributed Training, Few-Shot Evaluation, and Model Inference workflows that require GPU acceleration. It is the mandatory prerequisite for running any multi-GPU training via FSDP or DDP, and for distributed evaluation across multiple GPUs.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | SLURM integration requires Linux; Conda env specifies `openjdk` |
| Hardware | NVIDIA GPU with CUDA support | Multi-GPU recommended; single GPU supported |
| VRAM | 16GB+ per GPU | Required for 3B+ parameter models with FSDP |
| Distributed | SLURM or torchrun | SLURM uses `srun --ntasks-per-node=8 --gpus-per-task=1` |
Dependencies
System Packages
- CUDA toolkit (compatible with PyTorch 2.0.1)
- NCCL (default distributed backend)
Python Packages
- `torch` == 2.0.1
- `torchvision`
- `numpy`
Credentials
The following environment variables are used for distributed setup (set automatically by SLURM or torchrun):
- `LOCAL_RANK`: Local rank of the process on the node
- `RANK`: Global rank of the process
- `WORLD_SIZE`: Total number of processes
- `SLURM_PROCID`: SLURM process ID (SLURM only)
- `SLURM_NTASKS`: Total SLURM tasks (SLURM only)
- `SLURM_LOCALID`: SLURM local ID (SLURM only)
- `MASTER_ADDR`: Master node address (set in launch script)
- `MASTER_PORT`: Master node port (set in launch script)
- `WANDB_MODE`: Set to `offline` when `--offline` flag is used
- `TRANSFORMERS_OFFLINE`: Set to `1` when `--offline` flag is used
Quick Install
# Install core PyTorch with CUDA
pip install torch==2.0.1 torchvision
# For SLURM launch (example from run_train.sh)
srun --ntasks-per-node=8 --gpus-per-task=1 python train.py --dist-backend nccl
# For torchrun launch
torchrun --nproc_per_node=8 train.py --dist-backend nccl
Code Evidence
CUDA device detection and distributed init from `open_flamingo/train/distributed.py:73-132`:
def init_distributed_device(args):
args.distributed = False
args.world_size = 1
args.rank = 0
args.local_rank = 0
if args.horovod:
assert hvd is not None, "Horovod is not installed"
hvd.init()
...
elif is_using_distributed():
if "SLURM_PROCID" in os.environ:
# DDP via SLURM
args.local_rank, args.rank, args.world_size = world_info_from_env()
torch.distributed.init_process_group(
backend=args.dist_backend,
init_method=args.dist_url,
world_size=args.world_size,
rank=args.rank,
)
else:
# DDP via torchrun, torch.distributed.launch
args.local_rank, _, _ = world_info_from_env()
torch.distributed.init_process_group(
backend=args.dist_backend, init_method=args.dist_url
)
...
if torch.cuda.is_available():
if args.distributed and not args.no_set_device_rank:
device = "cuda:%d" % args.local_rank
else:
device = "cuda:0"
torch.cuda.set_device(device)
else:
device = "cpu"
Environment variable detection for distributed backends from `open_flamingo/train/distributed.py:40-70`:
def is_using_distributed():
if "WORLD_SIZE" in os.environ:
return int(os.environ["WORLD_SIZE"]) > 1
if "SLURM_NTASKS" in os.environ:
return int(os.environ["SLURM_NTASKS"]) > 1
return False
def world_info_from_env():
local_rank = 0
for v in ("LOCAL_RANK", "MPI_LOCALRANKID", "SLURM_LOCALID",
"OMPI_COMM_WORLD_LOCAL_RANK"):
if v in os.environ:
local_rank = int(os.environ[v])
break
...
SLURM launch script from `open_flamingo/scripts/run_train.sh:1-12`:
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-task=1
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=15000
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `Horovod is not installed` | `--horovod` flag set without horovod package | Install horovod or use default NCCL backend |
| NCCL timeout / hang | Firewall blocking inter-node communication | Ensure `MASTER_ADDR` and `MASTER_PORT` are accessible across nodes |
| `CUDA out of memory` | Model too large for available VRAM | Use `--fsdp` flag, reduce batch size, or enable `--gradient_checkpointing` |
| Single-GPU fallback to CPU | No CUDA device detected | Verify CUDA toolkit installation and `nvidia-smi` output |
Compatibility Notes
- Horovod: Optional alternative to NCCL. Detected via `OMPI_COMM_WORLD_RANK` / `PMI_RANK` environment variables. Requires explicit `--horovod` flag.
- SLURM vs torchrun: Both are supported. SLURM is detected via `SLURM_PROCID` environment variable; torchrun via `LOCAL_RANK`.
- Single GPU: Falls back to `torch.distributed.init_process_group` with `world_size=1, rank=0` even for single-GPU runs.
- CPU-only: Supported but not recommended for training. Device falls back to `"cpu"` when `torch.cuda.is_available()` returns False.
Related Pages
- Implementation:Mlfoundations_Open_flamingo_Init_distributed_device
- Implementation:Mlfoundations_Open_flamingo_Flamingo_wrap_fsdp
- Implementation:Mlfoundations_Open_flamingo_Train_one_epoch
- Implementation:Mlfoundations_Open_flamingo_Save_checkpoint
- Implementation:Mlfoundations_Open_flamingo_All_gather_json_dump
- Implementation:Mlfoundations_Open_flamingo_AdamW_cosine_schedule