Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Transformers Init Process Group

From Leeroopedia
Knowledge Sources
Domains Distributed_Computing, Training
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete tool for initializing the NCCL distributed process group provided by PyTorch and used in the Hugging Face Transformers 3D parallel training example.

Description

This wrapper calls torch.distributed.init_process_group("nccl") to establish the distributed communication backend for multi-GPU training. After initialization, each process retrieves its rank (global process identifier), world_size (total number of processes), and local_rank (per-node GPU index). The local rank is used to pin each process to a specific CUDA device via torch.cuda.set_device(local_rank).

The code also validates that the total world size equals the product of the three parallelism dimensions (TP x DP x CP), ensuring that the mesh topology is consistent with the number of available GPUs.

Usage

Use this wrapper at the very beginning of a distributed training script, after verifying that the environment variables RANK and WORLD_SIZE are set (indicating the script was launched via torchrun or an equivalent distributed launcher). This must be called before constructing a DeviceMesh, loading a model with tensor parallelism, or performing any collective operations.

Code Reference

Source Location

  • Repository: transformers
  • File: examples/3D_parallel.py
  • Line: 91

Signature

torch.distributed.init_process_group(backend="nccl")

Import

import torch.distributed as dist

I/O Contract

Inputs

Name Type Required Description
backend str Yes Communication backend to use. Set to "nccl" for GPU training.

Outputs

Name Type Description
(side effect) None Initializes the default process group. After this call, dist.get_rank(), dist.get_world_size(), and dist.is_initialized() become available.

Environment Variables Required

Name Type Description
RANK int Global rank of this process, set by torchrun.
WORLD_SIZE int Total number of processes, set by torchrun.
LOCAL_RANK int Local rank on this node, used for CUDA device assignment.
MASTER_ADDR str Address of the rank-0 node for rendezvous.
MASTER_PORT str Port on the rank-0 node for rendezvous.

Usage Examples

Basic Usage

import os
import torch
import torch.distributed as dist

# Initialize distributed environment (called by each process)
if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

With Parallelism Validation

tp_size = int(os.environ.get("TP_SIZE", "1"))
dp_size = int(os.environ.get("DP_SIZE", "1"))
cp_size = int(os.environ.get("CP_SIZE", "1"))

dist.init_process_group("nccl")
world_size = dist.get_world_size()

assert world_size == tp_size * dp_size * cp_size, (
    f"World size ({world_size}) must equal TP ({tp_size}) * DP ({dp_size}) * CP ({cp_size})"
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment