Environment:Huggingface Transformers 3D Parallel Multi GPU

Knowledge Sources	Huggingface Transformers examples/3D_parallel.py
Domains	Distributed_Training, Infrastructure, GPU
Last Updated	2026-02-13 20:00 GMT

Overview

Multi-GPU environment with NCCL backend for Tensor Parallelism, FSDP Data Parallelism, and Context Parallelism training.

Description

This environment provides the infrastructure for 3D parallel distributed training combining Tensor Parallelism (TP), Fully Sharded Data Parallelism (FSDP), and Context Parallelism (CP). It requires multiple NVIDIA GPUs connected via NVLink or PCIe, the NCCL communication backend, and PyTorch distributed (torchrun). The environment uses DeviceMesh to organize GPUs into a multi-dimensional grid of (DP, TP, CP) dimensions.

Usage

Required for the 3D Parallel Distributed Training workflow. Use this when training models too large for a single GPU, or when you need to scale training across multiple GPUs/nodes for throughput.

System Requirements

Category	Requirement	Notes
OS	Linux	NCCL requires Linux
Hardware	Multiple NVIDIA GPUs	world_size = TP_SIZE x DP_SIZE x CP_SIZE
VRAM	>= 16GB per GPU	A100 40GB/80GB recommended
Interconnect	NVLink or PCIe	NVLink strongly recommended for TP
CUDA	11.8+ or 12.x	Must support NCCL

Dependencies

System Packages

NVIDIA CUDA Toolkit 11.8+ or 12.x
NCCL 2.x (usually bundled with PyTorch)
torchrun (from PyTorch distributed)

Python Packages

torch >= 2.4.0 (with distributed support)
torch.distributed (NCCL backend)
torch.distributed.fsdp (FSDP)
torch.distributed.tensor (Tensor Parallelism, DTensor)
torch.distributed.checkpoint (DCP)
transformers >= 5.0
datasets >= 2.15.0
wandb (optional, for experiment tracking)

Credentials

WANDB_API_KEY: Weights & Biases API key (optional, for logging).
HF_TOKEN: HuggingFace API token (for loading gated models).

Quick Install

pip install transformers[torch] datasets wandb

# Launch with torchrun
TP_SIZE=2 DP_SIZE=2 torchrun --nproc_per_node=4 examples/3D_parallel.py

Code Evidence

NCCL initialization and world size assertion from examples/3D_parallel.py:90-99:

if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    assert world_size == tp_size * dp_size * cp_size, (
        f"World size ({world_size}) must equal TP size ({tp_size}) * DP size ({dp_size}) * CP size ({cp_size})"
    )

Environment variables for parallelism dimensions from examples/3D_parallel.py:75-77:

tp_size = int(os.environ.get("TP_SIZE", "1"))
dp_size = int(os.environ.get("DP_SIZE", "1"))
cp_size = int(os.environ.get("CP_SIZE", "1"))

DeviceMesh construction from examples/3D_parallel.py:101-102:

mesh = torch.arange(world_size).reshape(dp_size, tp_size, cp_size)
world_mesh = DeviceMesh(device_type="cuda", mesh=mesh, mesh_dim_names=("dp", "tp", "cp"))

Common Errors

Error Message	Cause	Solution
`World size must equal TP * DP * CP`	GPU count mismatch	Ensure `--nproc_per_node` equals TP_SIZE * DP_SIZE * CP_SIZE
`NCCL error: unhandled cuda error`	GPU communication failure	Check NVLink/PCIe connectivity and CUDA driver version
`RuntimeError: RANK not set`	Not launched with torchrun	Use `torchrun --nproc_per_node=N` to launch
`Global batch size not divisible by DP size`	Batch/DP mismatch	Set global batch size to a multiple of DP_SIZE

Compatibility Notes

Single GPU: Can run with TP=1, DP=1, CP=1 for debugging (use IGNORE_SANITY=1).
Multi-node: Requires --rdzv_endpoint for rendezvous coordination.
Context Parallelism: Requires SDPBackend.FLASH_ATTENTION for the SDPA kernel.
FSDP: Uses ShardingStrategy.NO_SHARD (DDP-like) in the example; full sharding also available.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment