Environment:Speechbrain Speechbrain Multi GPU DDP
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-09 20:00 GMT |
Overview
Multi-GPU distributed training environment using PyTorch DDP (DistributedDataParallel) with NCCL, gloo, or MPI backends, launched via `torchrun`.
Description
SpeechBrain supports multi-GPU training through PyTorch's DistributedDataParallel (DDP) framework. The recommended launch method is `torchrun`, which sets the required environment variables (`RANK`, `LOCAL_RANK`, `WORLD_SIZE`). Three communication backends are supported: NCCL (NVIDIA GPUs, recommended), gloo (CPU or cross-platform), and MPI. The framework automatically wraps modules in DDP, converts BatchNorm to SyncBatchNorm, and handles distributed sampling.
Usage
Required when training on multiple GPUs within a single node or across multiple nodes. Not needed for single-GPU or CPU-only training. All SpeechBrain training implementations that use `Brain.fit()` automatically support DDP when launched with `torchrun`.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware | Multiple NVIDIA GPUs (for NCCL) | gloo works with CPU; MPI requires MPI installation |
| Network | High-bandwidth interconnect for multi-node | InfiniBand or NVLink recommended |
| Software | NCCL library (for NCCL backend) | Bundled with CUDA PyTorch builds |
Dependencies
System Packages
- `nccl` (bundled with PyTorch CUDA builds)
- `openmpi` (only if using MPI backend)
Python Packages
- `torch` >= 1.9 with distributed support
- All core SpeechBrain dependencies
Credentials
The following environment variables must be set (automatically by `torchrun`):
- `RANK`: Global rank of the process
- `LOCAL_RANK`: Local rank within the node (maps to CUDA device)
- `WORLD_SIZE`: Total number of processes
- `MASTER_ADDR`: Address of the master node (multi-node only)
- `MASTER_PORT`: Port of the master node (multi-node only)
Optional:
- `CUDA_VISIBLE_DEVICES`: Restrict visible GPUs
- `SLURM_PROCID`: Used as rank fallback in Slurm environments
- `JSM_NAMESPACE_RANK`: Used as rank fallback in JSM environments
Quick Install
# No additional installation needed; DDP is built into PyTorch
# Single-node, 4 GPUs
torchrun --standalone --nproc_per_node=4 train.py hparams/train.yaml
# Multi-node, 2 nodes x 2 GPUs each
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=0 \
--master_addr=node0_ip --master_port=5555 \
train.py hparams/train.yaml
Code Evidence
DDP initialization from `speechbrain/utils/distributed.py:276-320`:
rank = os.environ.get("RANK")
local_rank = os.environ.get("LOCAL_RANK")
if local_rank is None or rank is None:
return
if run_opts["distributed_backend"] == "nccl":
if not torch.distributed.is_nccl_available():
raise ValueError("NCCL is not supported in your machine.")
torch.distributed.init_process_group(
backend=run_opts["distributed_backend"],
rank=rank,
timeout=datetime.timedelta(seconds=7200),
)
DP/DDP mutual exclusivity from `speechbrain/core.py:667-674`:
if self.data_parallel_backend and self.distributed_launch:
raise ValueError(
"To use data_parallel backend, start your script with:\n\t"
"python experiment.py hyperparams.yaml --data_parallel_backend=True\n"
"To use DDP backend, start your script with:\n\t"
"torchrun [args] experiment.py hyperparams.yaml"
)
SyncBatchNorm auto-conversion from `speechbrain/core.py:1672`:
module = SyncBatchNorm.convert_sync_batchnorm(module)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `NCCL is not supported in your machine` | PyTorch built without NCCL | Reinstall PyTorch with CUDA support |
| `Not enough GPUs available!` | LOCAL_RANK exceeds GPU count | Reduce `--nproc_per_node` to match GPU count |
| `Cannot use data_parallel and DDP simultaneously` | Both DP and DDP flags set | Use only one: either `--data_parallel_backend` or `torchrun` |
| `Cannot automatically solve distributed sampling for IterableDataset` | Using IterableDataset with DDP | Implement manual sharding in the IterableDataset |
Compatibility Notes
- NCCL backend: Requires NVIDIA GPUs. Default and recommended for GPU training.
- gloo backend: Works with CPU. Sets `device_ids=None` for DDP wrapper.
- MPI backend: Requires OpenMPI installation. Less commonly used.
- DataParallel (legacy): Deprecated alternative. Cannot be combined with DDP.
- DDP timeout: Set to 7200 seconds (2 hours) to accommodate long-running operations.
- Gradient accumulation: Uses `no_sync` context to skip gradient synchronization during accumulation steps.