Environment:Junyanz Pytorch CycleGAN and pix2pix DDP Multi GPU
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Training |
| Last Updated | 2026-02-09 16:00 GMT |
Overview
Multi-GPU distributed training environment using PyTorch DistributedDataParallel (DDP) with NCCL backend, requiring torchrun launcher and synchronized normalization.
Description
This environment extends the base Python/PyTorch runtime with support for single-machine multi-GPU training via PyTorch's DistributedDataParallel (DDP). The NCCL backend is used for inter-GPU communication. Training is launched via `torchrun` which sets the required environment variables (`WORLD_SIZE`, `LOCAL_RANK`, `RANK`). Standard batch normalization is not compatible with DDP; users must use `--norm syncbatch` (SyncBatchNorm) or `--norm instance` (InstanceNorm). The codebase handles DDP-aware data loading via `DistributedSampler`, rank-0-only I/O operations (saving checkpoints, logging), and process synchronization barriers.
Usage
Use this environment when training on multiple GPUs on a single machine. Launch training with `torchrun --nproc_per_node=N train.py ...` instead of `python train.py ...`. This is optional; single-GPU and CPU training do not require this environment.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | NCCL backend requires Linux |
| Hardware | 2+ NVIDIA GPUs | All GPUs must support CUDA |
| CUDA | 12.1+ | Matching PyTorch CUDA build |
| Network | N/A (single-machine only) | Inter-GPU communication via NVLink/PCIe |
Dependencies
System Packages
- All packages from the base Python_PyTorch_Runtime environment
- NCCL library (bundled with PyTorch CUDA builds)
Python Packages
- `torch` >= 2.4.0 (with `torch.distributed` module)
- `torchrun` CLI (included with PyTorch installation)
Credentials
Environment variables set automatically by torchrun:
- `WORLD_SIZE`: Total number of processes (set by torchrun)
- `LOCAL_RANK`: Process rank within current node (set by torchrun)
- `RANK`: Global process rank (set by torchrun)
These are not user-configured; they are injected by the `torchrun` launcher.
Quick Install
# Same base environment as Python_PyTorch_Runtime
conda env create -f environment.yml
conda activate pytorch-img2img
# Launch multi-GPU training (4 GPUs example)
torchrun --nproc_per_node=4 train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan --norm syncbatch
Code Evidence
DDP initialization from `util/util.py:53-69`:
def init_ddp():
is_ddp = "WORLD_SIZE" in os.environ and int(os.environ["WORLD_SIZE"]) > 1
if is_ddp:
if not dist.is_initialized():
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")
torch.cuda.set_device(local_rank)
elif torch.cuda.is_available():
device = torch.device("cuda:0")
torch.cuda.set_device(0)
else:
device = torch.device("cpu")
return device
DDP wrapping with barrier synchronization from `models/base_model.py:116-123`:
if dist.is_initialized():
if self.opt.norm == "syncbatch":
raise ValueError(...)
net = torch.nn.parallel.DistributedDataParallel(
net, device_ids=[self.device.index]
)
dist.barrier()
DistributedSampler selection from `data/__init__.py:79-86`:
if "LOCAL_RANK" in os.environ:
self.sampler = DistributedSampler(
self.dataset, shuffle=not opt.serial_batches
)
shuffle = False # DistributedSampler handles shuffling
else:
self.sampler = None
shuffle = not opt.serial_batches
Rank-0-only checkpoint saving from `models/base_model.py:188-189`:
if not dist.is_initialized() or dist.get_rank() == 0:
for name in self.model_names:
# ... save logic
SyncBatchNorm layer registration from `models/networks.py:29-30`:
elif norm_type == "syncbatch":
norm_layer = functools.partial(
nn.SyncBatchNorm, affine=True, track_running_stats=True
)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `--norm batch is not compatible with DDP` | Standard BatchNorm does not sync across GPUs | Use `--norm syncbatch` or `--norm instance` |
| Process hangs during training | Missing barrier synchronization | Ensure all processes reach the same barrier point |
| `RuntimeError: NCCL error` | NCCL communication failure | Check that all GPUs are visible and CUDA is working |
| Inconsistent results across runs | DistributedSampler not seeded per epoch | The code handles this via `set_epoch()` in train.py |
Compatibility Notes
- Single GPU: DDP is not activated; standard single-GPU training is used automatically.
- CPU: DDP is not supported on CPU; the NCCL backend requires CUDA GPUs.
- Multi-node: The current codebase only supports single-machine multi-GPU. Multi-node training would require additional configuration.
- Batch normalization: Standard `--norm batch` does not work with DDP because batchnorm statistics are not shared across GPUs. Use `--norm syncbatch` for synchronized batchnorm or `--norm instance` for instance normalization.
Related Pages
- Implementation:Junyanz_Pytorch_CycleGAN_and_pix2pix_CycleGANModel_Optimize_Parameters
- Implementation:Junyanz_Pytorch_CycleGAN_and_pix2pix_Pix2PixModel_Optimize_Parameters
- Implementation:Junyanz_Pytorch_CycleGAN_and_pix2pix_Create_Dataset
- Implementation:Junyanz_Pytorch_CycleGAN_and_pix2pix_Define_G_and_D