Implementation:Huggingface Transformers Init Process Group
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Training |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete tool for initializing the NCCL distributed process group provided by PyTorch and used in the Hugging Face Transformers 3D parallel training example.
Description
This wrapper calls torch.distributed.init_process_group("nccl") to establish the distributed communication backend for multi-GPU training. After initialization, each process retrieves its rank (global process identifier), world_size (total number of processes), and local_rank (per-node GPU index). The local rank is used to pin each process to a specific CUDA device via torch.cuda.set_device(local_rank).
The code also validates that the total world size equals the product of the three parallelism dimensions (TP x DP x CP), ensuring that the mesh topology is consistent with the number of available GPUs.
Usage
Use this wrapper at the very beginning of a distributed training script, after verifying that the environment variables RANK and WORLD_SIZE are set (indicating the script was launched via torchrun or an equivalent distributed launcher). This must be called before constructing a DeviceMesh, loading a model with tensor parallelism, or performing any collective operations.
Code Reference
Source Location
- Repository: transformers
- File:
examples/3D_parallel.py - Line: 91
Signature
torch.distributed.init_process_group(backend="nccl")
Import
import torch.distributed as dist
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| backend | str | Yes | Communication backend to use. Set to "nccl" for GPU training.
|
Outputs
| Name | Type | Description |
|---|---|---|
| (side effect) | None | Initializes the default process group. After this call, dist.get_rank(), dist.get_world_size(), and dist.is_initialized() become available.
|
Environment Variables Required
| Name | Type | Description |
|---|---|---|
| RANK | int | Global rank of this process, set by torchrun. |
| WORLD_SIZE | int | Total number of processes, set by torchrun. |
| LOCAL_RANK | int | Local rank on this node, used for CUDA device assignment. |
| MASTER_ADDR | str | Address of the rank-0 node for rendezvous. |
| MASTER_PORT | str | Port on the rank-0 node for rendezvous. |
Usage Examples
Basic Usage
import os
import torch
import torch.distributed as dist
# Initialize distributed environment (called by each process)
if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
dist.init_process_group("nccl")
rank = dist.get_rank()
world_size = dist.get_world_size()
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
With Parallelism Validation
tp_size = int(os.environ.get("TP_SIZE", "1"))
dp_size = int(os.environ.get("DP_SIZE", "1"))
cp_size = int(os.environ.get("CP_SIZE", "1"))
dist.init_process_group("nccl")
world_size = dist.get_world_size()
assert world_size == tp_size * dp_size * cp_size, (
f"World size ({world_size}) must equal TP ({tp_size}) * DP ({dp_size}) * CP ({cp_size})"
)