Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Speechbrain Speechbrain Multi GPU DDP

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Distributed_Training
Last Updated 2026-02-09 20:00 GMT

Overview

Multi-GPU distributed training environment using PyTorch DDP (DistributedDataParallel) with NCCL, gloo, or MPI backends, launched via `torchrun`.

Description

SpeechBrain supports multi-GPU training through PyTorch's DistributedDataParallel (DDP) framework. The recommended launch method is `torchrun`, which sets the required environment variables (`RANK`, `LOCAL_RANK`, `WORLD_SIZE`). Three communication backends are supported: NCCL (NVIDIA GPUs, recommended), gloo (CPU or cross-platform), and MPI. The framework automatically wraps modules in DDP, converts BatchNorm to SyncBatchNorm, and handles distributed sampling.

Usage

Required when training on multiple GPUs within a single node or across multiple nodes. Not needed for single-GPU or CPU-only training. All SpeechBrain training implementations that use `Brain.fit()` automatically support DDP when launched with `torchrun`.

System Requirements

Category Requirement Notes
Hardware Multiple NVIDIA GPUs (for NCCL) gloo works with CPU; MPI requires MPI installation
Network High-bandwidth interconnect for multi-node InfiniBand or NVLink recommended
Software NCCL library (for NCCL backend) Bundled with CUDA PyTorch builds

Dependencies

System Packages

  • `nccl` (bundled with PyTorch CUDA builds)
  • `openmpi` (only if using MPI backend)

Python Packages

  • `torch` >= 1.9 with distributed support
  • All core SpeechBrain dependencies

Credentials

The following environment variables must be set (automatically by `torchrun`):

  • `RANK`: Global rank of the process
  • `LOCAL_RANK`: Local rank within the node (maps to CUDA device)
  • `WORLD_SIZE`: Total number of processes
  • `MASTER_ADDR`: Address of the master node (multi-node only)
  • `MASTER_PORT`: Port of the master node (multi-node only)

Optional:

  • `CUDA_VISIBLE_DEVICES`: Restrict visible GPUs
  • `SLURM_PROCID`: Used as rank fallback in Slurm environments
  • `JSM_NAMESPACE_RANK`: Used as rank fallback in JSM environments

Quick Install

# No additional installation needed; DDP is built into PyTorch

# Single-node, 4 GPUs
torchrun --standalone --nproc_per_node=4 train.py hparams/train.yaml

# Multi-node, 2 nodes x 2 GPUs each
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 \
    --master_addr=node0_ip --master_port=5555 \
    train.py hparams/train.yaml

Code Evidence

DDP initialization from `speechbrain/utils/distributed.py:276-320`:

rank = os.environ.get("RANK")
local_rank = os.environ.get("LOCAL_RANK")
if local_rank is None or rank is None:
    return

if run_opts["distributed_backend"] == "nccl":
    if not torch.distributed.is_nccl_available():
        raise ValueError("NCCL is not supported in your machine.")

torch.distributed.init_process_group(
    backend=run_opts["distributed_backend"],
    rank=rank,
    timeout=datetime.timedelta(seconds=7200),
)

DP/DDP mutual exclusivity from `speechbrain/core.py:667-674`:

if self.data_parallel_backend and self.distributed_launch:
    raise ValueError(
        "To use data_parallel backend, start your script with:\n\t"
        "python experiment.py hyperparams.yaml --data_parallel_backend=True\n"
        "To use DDP backend, start your script with:\n\t"
        "torchrun [args] experiment.py hyperparams.yaml"
    )

SyncBatchNorm auto-conversion from `speechbrain/core.py:1672`:

module = SyncBatchNorm.convert_sync_batchnorm(module)

Common Errors

Error Message Cause Solution
`NCCL is not supported in your machine` PyTorch built without NCCL Reinstall PyTorch with CUDA support
`Not enough GPUs available!` LOCAL_RANK exceeds GPU count Reduce `--nproc_per_node` to match GPU count
`Cannot use data_parallel and DDP simultaneously` Both DP and DDP flags set Use only one: either `--data_parallel_backend` or `torchrun`
`Cannot automatically solve distributed sampling for IterableDataset` Using IterableDataset with DDP Implement manual sharding in the IterableDataset

Compatibility Notes

  • NCCL backend: Requires NVIDIA GPUs. Default and recommended for GPU training.
  • gloo backend: Works with CPU. Sets `device_ids=None` for DDP wrapper.
  • MPI backend: Requires OpenMPI installation. Less commonly used.
  • DataParallel (legacy): Deprecated alternative. Cannot be combined with DDP.
  • DDP timeout: Set to 7200 seconds (2 hours) to accommodate long-running operations.
  • Gradient accumulation: Uses `no_sync` context to skip gradient synchronization during accumulation steps.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment