Environment:Speechbrain Speechbrain Multi GPU DDP

Knowledge Sources	SpeechBrain SpeechBrain Multi-GPU
Domains	Infrastructure, Distributed_Training
Last Updated	2026-02-09 20:00 GMT

Overview

Multi-GPU distributed training environment using PyTorch DDP (DistributedDataParallel) with NCCL, gloo, or MPI backends, launched via `torchrun`.

Description

SpeechBrain supports multi-GPU training through PyTorch's DistributedDataParallel (DDP) framework. The recommended launch method is `torchrun`, which sets the required environment variables (`RANK`, `LOCAL_RANK`, `WORLD_SIZE`). Three communication backends are supported: NCCL (NVIDIA GPUs, recommended), gloo (CPU or cross-platform), and MPI. The framework automatically wraps modules in DDP, converts BatchNorm to SyncBatchNorm, and handles distributed sampling.

Usage

Required when training on multiple GPUs within a single node or across multiple nodes. Not needed for single-GPU or CPU-only training. All SpeechBrain training implementations that use `Brain.fit()` automatically support DDP when launched with `torchrun`.

System Requirements

Category	Requirement	Notes
Hardware	Multiple NVIDIA GPUs (for NCCL)	gloo works with CPU; MPI requires MPI installation
Network	High-bandwidth interconnect for multi-node	InfiniBand or NVLink recommended
Software	NCCL library (for NCCL backend)	Bundled with CUDA PyTorch builds

Dependencies

System Packages

`nccl` (bundled with PyTorch CUDA builds)
`openmpi` (only if using MPI backend)

Python Packages

`torch` >= 1.9 with distributed support
All core SpeechBrain dependencies

Credentials

The following environment variables must be set (automatically by `torchrun`):

`RANK`: Global rank of the process
`LOCAL_RANK`: Local rank within the node (maps to CUDA device)
`WORLD_SIZE`: Total number of processes
`MASTER_ADDR`: Address of the master node (multi-node only)
`MASTER_PORT`: Port of the master node (multi-node only)

Optional:

`CUDA_VISIBLE_DEVICES`: Restrict visible GPUs
`SLURM_PROCID`: Used as rank fallback in Slurm environments
`JSM_NAMESPACE_RANK`: Used as rank fallback in JSM environments

Quick Install

# No additional installation needed; DDP is built into PyTorch

# Single-node, 4 GPUs
torchrun --standalone --nproc_per_node=4 train.py hparams/train.yaml

# Multi-node, 2 nodes x 2 GPUs each
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 \
    --master_addr=node0_ip --master_port=5555 \
    train.py hparams/train.yaml

Code Evidence

DDP initialization from `speechbrain/utils/distributed.py:276-320`:

rank = os.environ.get("RANK")
local_rank = os.environ.get("LOCAL_RANK")
if local_rank is None or rank is None:
    return

if run_opts["distributed_backend"] == "nccl":
    if not torch.distributed.is_nccl_available():
        raise ValueError("NCCL is not supported in your machine.")

torch.distributed.init_process_group(
    backend=run_opts["distributed_backend"],
    rank=rank,
    timeout=datetime.timedelta(seconds=7200),
)

DP/DDP mutual exclusivity from `speechbrain/core.py:667-674`:

if self.data_parallel_backend and self.distributed_launch:
    raise ValueError(
        "To use data_parallel backend, start your script with:\n\t"
        "python experiment.py hyperparams.yaml --data_parallel_backend=True\n"
        "To use DDP backend, start your script with:\n\t"
        "torchrun [args] experiment.py hyperparams.yaml"
    )

SyncBatchNorm auto-conversion from `speechbrain/core.py:1672`:

module = SyncBatchNorm.convert_sync_batchnorm(module)

Common Errors

Error Message	Cause	Solution
`NCCL is not supported in your machine`	PyTorch built without NCCL	Reinstall PyTorch with CUDA support
`Not enough GPUs available!`	LOCAL_RANK exceeds GPU count	Reduce `--nproc_per_node` to match GPU count
`Cannot use data_parallel and DDP simultaneously`	Both DP and DDP flags set	Use only one: either `--data_parallel_backend` or `torchrun`
`Cannot automatically solve distributed sampling for IterableDataset`	Using IterableDataset with DDP	Implement manual sharding in the IterableDataset

Compatibility Notes

NCCL backend: Requires NVIDIA GPUs. Default and recommended for GPU training.
gloo backend: Works with CPU. Sets `device_ids=None` for DDP wrapper.
MPI backend: Requires OpenMPI installation. Less commonly used.
DataParallel (legacy): Deprecated alternative. Cannot be combined with DDP.
DDP timeout: Set to 7200 seconds (2 hours) to accommodate long-running operations.
Gradient accumulation: Uses `no_sync` context to skip gradient synchronization during accumulation steps.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment