Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Huggingface Transformers 3D Parallel Multi GPU

From Leeroopedia
Knowledge Sources
Domains Distributed_Training, Infrastructure, GPU
Last Updated 2026-02-13 20:00 GMT

Overview

Multi-GPU environment with NCCL backend for Tensor Parallelism, FSDP Data Parallelism, and Context Parallelism training.

Description

This environment provides the infrastructure for 3D parallel distributed training combining Tensor Parallelism (TP), Fully Sharded Data Parallelism (FSDP), and Context Parallelism (CP). It requires multiple NVIDIA GPUs connected via NVLink or PCIe, the NCCL communication backend, and PyTorch distributed (torchrun). The environment uses DeviceMesh to organize GPUs into a multi-dimensional grid of (DP, TP, CP) dimensions.

Usage

Required for the 3D Parallel Distributed Training workflow. Use this when training models too large for a single GPU, or when you need to scale training across multiple GPUs/nodes for throughput.

System Requirements

Category Requirement Notes
OS Linux NCCL requires Linux
Hardware Multiple NVIDIA GPUs world_size = TP_SIZE x DP_SIZE x CP_SIZE
VRAM >= 16GB per GPU A100 40GB/80GB recommended
Interconnect NVLink or PCIe NVLink strongly recommended for TP
CUDA 11.8+ or 12.x Must support NCCL

Dependencies

System Packages

  • NVIDIA CUDA Toolkit 11.8+ or 12.x
  • NCCL 2.x (usually bundled with PyTorch)
  • torchrun (from PyTorch distributed)

Python Packages

  • torch >= 2.4.0 (with distributed support)
  • torch.distributed (NCCL backend)
  • torch.distributed.fsdp (FSDP)
  • torch.distributed.tensor (Tensor Parallelism, DTensor)
  • torch.distributed.checkpoint (DCP)
  • transformers >= 5.0
  • datasets >= 2.15.0
  • wandb (optional, for experiment tracking)

Credentials

  • WANDB_API_KEY: Weights & Biases API key (optional, for logging).
  • HF_TOKEN: HuggingFace API token (for loading gated models).

Quick Install

pip install transformers[torch] datasets wandb

# Launch with torchrun
TP_SIZE=2 DP_SIZE=2 torchrun --nproc_per_node=4 examples/3D_parallel.py

Code Evidence

NCCL initialization and world size assertion from examples/3D_parallel.py:90-99:

if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)

    assert world_size == tp_size * dp_size * cp_size, (
        f"World size ({world_size}) must equal TP size ({tp_size}) * DP size ({dp_size}) * CP size ({cp_size})"
    )

Environment variables for parallelism dimensions from examples/3D_parallel.py:75-77:

tp_size = int(os.environ.get("TP_SIZE", "1"))
dp_size = int(os.environ.get("DP_SIZE", "1"))
cp_size = int(os.environ.get("CP_SIZE", "1"))

DeviceMesh construction from examples/3D_parallel.py:101-102:

mesh = torch.arange(world_size).reshape(dp_size, tp_size, cp_size)
world_mesh = DeviceMesh(device_type="cuda", mesh=mesh, mesh_dim_names=("dp", "tp", "cp"))

Common Errors

Error Message Cause Solution
World size must equal TP * DP * CP GPU count mismatch Ensure --nproc_per_node equals TP_SIZE * DP_SIZE * CP_SIZE
NCCL error: unhandled cuda error GPU communication failure Check NVLink/PCIe connectivity and CUDA driver version
RuntimeError: RANK not set Not launched with torchrun Use torchrun --nproc_per_node=N to launch
Global batch size not divisible by DP size Batch/DP mismatch Set global batch size to a multiple of DP_SIZE

Compatibility Notes

  • Single GPU: Can run with TP=1, DP=1, CP=1 for debugging (use IGNORE_SANITY=1).
  • Multi-node: Requires --rdzv_endpoint for rendezvous coordination.
  • Context Parallelism: Requires SDPBackend.FLASH_ATTENTION for the SDPA kernel.
  • FSDP: Uses ShardingStrategy.NO_SHARD (DDP-like) in the example; full sharding also available.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment