Implementation:Huggingface Transformers Init Process Group

Knowledge Sources	Transformers PyTorch Distributed
Domains	Distributed_Computing, Training
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete tool for initializing the NCCL distributed process group provided by PyTorch and used in the Hugging Face Transformers 3D parallel training example.

Description

This wrapper calls torch.distributed.init_process_group("nccl") to establish the distributed communication backend for multi-GPU training. After initialization, each process retrieves its rank (global process identifier), world_size (total number of processes), and local_rank (per-node GPU index). The local rank is used to pin each process to a specific CUDA device via torch.cuda.set_device(local_rank).

The code also validates that the total world size equals the product of the three parallelism dimensions (TP x DP x CP), ensuring that the mesh topology is consistent with the number of available GPUs.

Usage

Use this wrapper at the very beginning of a distributed training script, after verifying that the environment variables RANK and WORLD_SIZE are set (indicating the script was launched via torchrun or an equivalent distributed launcher). This must be called before constructing a DeviceMesh, loading a model with tensor parallelism, or performing any collective operations.

Code Reference

Source Location

Repository: transformers
File: examples/3D_parallel.py
Line: 91

Signature

torch.distributed.init_process_group(backend="nccl")

Import

import torch.distributed as dist

I/O Contract

Inputs

Name	Type	Required	Description
backend	str	Yes	Communication backend to use. Set to `"nccl"` for GPU training.

Outputs

Name	Type	Description
(side effect)	None	Initializes the default process group. After this call, `dist.get_rank()`, `dist.get_world_size()`, and `dist.is_initialized()` become available.

Environment Variables Required

Name	Type	Description
RANK	int	Global rank of this process, set by torchrun.
WORLD_SIZE	int	Total number of processes, set by torchrun.
LOCAL_RANK	int	Local rank on this node, used for CUDA device assignment.
MASTER_ADDR	str	Address of the rank-0 node for rendezvous.
MASTER_PORT	str	Port on the rank-0 node for rendezvous.

Usage Examples

Basic Usage

import os
import torch
import torch.distributed as dist

# Initialize distributed environment (called by each process)
if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

With Parallelism Validation

tp_size = int(os.environ.get("TP_SIZE", "1"))
dp_size = int(os.environ.get("DP_SIZE", "1"))
cp_size = int(os.environ.get("CP_SIZE", "1"))

dist.init_process_group("nccl")
world_size = dist.get_world_size()

assert world_size == tp_size * dp_size * cp_size, (
    f"World size ({world_size}) must equal TP ({tp_size}) * DP ({dp_size}) * CP ({cp_size})"
)

Related Pages

Implements Principle

Principle:Huggingface_Transformers_Distributed_Process_Initialization

Requires Environment

Environment:Huggingface_Transformers_3D_Parallel_Multi_GPU

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment