Environment:Junyanz Pytorch CycleGAN and pix2pix DDP Multi GPU

Knowledge Sources	pytorch-CycleGAN-and-pix2pix PyTorch DDP
Domains	Infrastructure, Distributed_Training
Last Updated	2026-02-09 16:00 GMT

Overview

Multi-GPU distributed training environment using PyTorch DistributedDataParallel (DDP) with NCCL backend, requiring torchrun launcher and synchronized normalization.

Description

This environment extends the base Python/PyTorch runtime with support for single-machine multi-GPU training via PyTorch's DistributedDataParallel (DDP). The NCCL backend is used for inter-GPU communication. Training is launched via `torchrun` which sets the required environment variables (`WORLD_SIZE`, `LOCAL_RANK`, `RANK`). Standard batch normalization is not compatible with DDP; users must use `--norm syncbatch` (SyncBatchNorm) or `--norm instance` (InstanceNorm). The codebase handles DDP-aware data loading via `DistributedSampler`, rank-0-only I/O operations (saving checkpoints, logging), and process synchronization barriers.

Usage

Use this environment when training on multiple GPUs on a single machine. Launch training with `torchrun --nproc_per_node=N train.py ...` instead of `python train.py ...`. This is optional; single-GPU and CPU training do not require this environment.

System Requirements

Category	Requirement	Notes
OS	Linux	NCCL backend requires Linux
Hardware	2+ NVIDIA GPUs	All GPUs must support CUDA
CUDA	12.1+	Matching PyTorch CUDA build
Network	N/A (single-machine only)	Inter-GPU communication via NVLink/PCIe

Dependencies

System Packages

All packages from the base Python_PyTorch_Runtime environment
NCCL library (bundled with PyTorch CUDA builds)

Python Packages

`torch` >= 2.4.0 (with `torch.distributed` module)
`torchrun` CLI (included with PyTorch installation)

Credentials

Environment variables set automatically by torchrun:

`WORLD_SIZE`: Total number of processes (set by torchrun)
`LOCAL_RANK`: Process rank within current node (set by torchrun)
`RANK`: Global process rank (set by torchrun)

These are not user-configured; they are injected by the `torchrun` launcher.

Quick Install

# Same base environment as Python_PyTorch_Runtime
conda env create -f environment.yml
conda activate pytorch-img2img

# Launch multi-GPU training (4 GPUs example)
torchrun --nproc_per_node=4 train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan --norm syncbatch

Code Evidence

DDP initialization from `util/util.py:53-69`:

def init_ddp():
    is_ddp = "WORLD_SIZE" in os.environ and int(os.environ["WORLD_SIZE"]) > 1
    if is_ddp:
        if not dist.is_initialized():
            dist.init_process_group(backend="nccl")
        local_rank = int(os.environ["LOCAL_RANK"])
        device = torch.device(f"cuda:{local_rank}")
        torch.cuda.set_device(local_rank)
    elif torch.cuda.is_available():
        device = torch.device("cuda:0")
        torch.cuda.set_device(0)
    else:
        device = torch.device("cpu")
    return device

DDP wrapping with barrier synchronization from `models/base_model.py:116-123`:

if dist.is_initialized():
    if self.opt.norm == "syncbatch":
        raise ValueError(...)
    net = torch.nn.parallel.DistributedDataParallel(
        net, device_ids=[self.device.index]
    )
    dist.barrier()

DistributedSampler selection from `data/__init__.py:79-86`:

if "LOCAL_RANK" in os.environ:
    self.sampler = DistributedSampler(
        self.dataset, shuffle=not opt.serial_batches
    )
    shuffle = False  # DistributedSampler handles shuffling
else:
    self.sampler = None
    shuffle = not opt.serial_batches

Rank-0-only checkpoint saving from `models/base_model.py:188-189`:

if not dist.is_initialized() or dist.get_rank() == 0:
    for name in self.model_names:
        # ... save logic

SyncBatchNorm layer registration from `models/networks.py:29-30`:

elif norm_type == "syncbatch":
    norm_layer = functools.partial(
        nn.SyncBatchNorm, affine=True, track_running_stats=True
    )

Common Errors

Error Message	Cause	Solution
`--norm batch is not compatible with DDP`	Standard BatchNorm does not sync across GPUs	Use `--norm syncbatch` or `--norm instance`
Process hangs during training	Missing barrier synchronization	Ensure all processes reach the same barrier point
`RuntimeError: NCCL error`	NCCL communication failure	Check that all GPUs are visible and CUDA is working
Inconsistent results across runs	DistributedSampler not seeded per epoch	The code handles this via `set_epoch()` in train.py

Compatibility Notes

Single GPU: DDP is not activated; standard single-GPU training is used automatically.
CPU: DDP is not supported on CPU; the NCCL backend requires CUDA GPUs.
Multi-node: The current codebase only supports single-machine multi-GPU. Multi-node training would require additional configuration.
Batch normalization: Standard `--norm batch` does not work with DDP because batchnorm statistics are not shared across GPUs. Use `--norm syncbatch` for synchronized batchnorm or `--norm instance` for instance normalization.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment