Environment:Huggingface Transformers 3D Parallel Multi GPU
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Infrastructure, GPU |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
Multi-GPU environment with NCCL backend for Tensor Parallelism, FSDP Data Parallelism, and Context Parallelism training.
Description
This environment provides the infrastructure for 3D parallel distributed training combining Tensor Parallelism (TP), Fully Sharded Data Parallelism (FSDP), and Context Parallelism (CP). It requires multiple NVIDIA GPUs connected via NVLink or PCIe, the NCCL communication backend, and PyTorch distributed (torchrun). The environment uses DeviceMesh to organize GPUs into a multi-dimensional grid of (DP, TP, CP) dimensions.
Usage
Required for the 3D Parallel Distributed Training workflow. Use this when training models too large for a single GPU, or when you need to scale training across multiple GPUs/nodes for throughput.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | NCCL requires Linux |
| Hardware | Multiple NVIDIA GPUs | world_size = TP_SIZE x DP_SIZE x CP_SIZE |
| VRAM | >= 16GB per GPU | A100 40GB/80GB recommended |
| Interconnect | NVLink or PCIe | NVLink strongly recommended for TP |
| CUDA | 11.8+ or 12.x | Must support NCCL |
Dependencies
System Packages
- NVIDIA CUDA Toolkit 11.8+ or 12.x
- NCCL 2.x (usually bundled with PyTorch)
torchrun(from PyTorch distributed)
Python Packages
torch>= 2.4.0 (with distributed support)torch.distributed(NCCL backend)torch.distributed.fsdp(FSDP)torch.distributed.tensor(Tensor Parallelism, DTensor)torch.distributed.checkpoint(DCP)transformers>= 5.0datasets>= 2.15.0wandb(optional, for experiment tracking)
Credentials
WANDB_API_KEY: Weights & Biases API key (optional, for logging).HF_TOKEN: HuggingFace API token (for loading gated models).
Quick Install
pip install transformers[torch] datasets wandb
# Launch with torchrun
TP_SIZE=2 DP_SIZE=2 torchrun --nproc_per_node=4 examples/3D_parallel.py
Code Evidence
NCCL initialization and world size assertion from examples/3D_parallel.py:90-99:
if "RANK" in os.environ and "WORLD_SIZE" in os.environ:
dist.init_process_group("nccl")
rank = dist.get_rank()
world_size = dist.get_world_size()
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
assert world_size == tp_size * dp_size * cp_size, (
f"World size ({world_size}) must equal TP size ({tp_size}) * DP size ({dp_size}) * CP size ({cp_size})"
)
Environment variables for parallelism dimensions from examples/3D_parallel.py:75-77:
tp_size = int(os.environ.get("TP_SIZE", "1"))
dp_size = int(os.environ.get("DP_SIZE", "1"))
cp_size = int(os.environ.get("CP_SIZE", "1"))
DeviceMesh construction from examples/3D_parallel.py:101-102:
mesh = torch.arange(world_size).reshape(dp_size, tp_size, cp_size)
world_mesh = DeviceMesh(device_type="cuda", mesh=mesh, mesh_dim_names=("dp", "tp", "cp"))
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
World size must equal TP * DP * CP |
GPU count mismatch | Ensure --nproc_per_node equals TP_SIZE * DP_SIZE * CP_SIZE
|
NCCL error: unhandled cuda error |
GPU communication failure | Check NVLink/PCIe connectivity and CUDA driver version |
RuntimeError: RANK not set |
Not launched with torchrun | Use torchrun --nproc_per_node=N to launch
|
Global batch size not divisible by DP size |
Batch/DP mismatch | Set global batch size to a multiple of DP_SIZE |
Compatibility Notes
- Single GPU: Can run with TP=1, DP=1, CP=1 for debugging (use
IGNORE_SANITY=1). - Multi-node: Requires
--rdzv_endpointfor rendezvous coordination. - Context Parallelism: Requires
SDPBackend.FLASH_ATTENTIONfor the SDPA kernel. - FSDP: Uses
ShardingStrategy.NO_SHARD(DDP-like) in the example; full sharding also available.
Related Pages
- Implementation:Huggingface_Transformers_Init_Process_Group
- Implementation:Huggingface_Transformers_DeviceMesh_Construction
- Implementation:Huggingface_Transformers_AutoModelForCausalLM_From_Pretrained_For_TP
- Implementation:Huggingface_Transformers_FSDP_Wrapping
- Implementation:Huggingface_Transformers_Context_Parallel_Training_Loop
- Implementation:Huggingface_Transformers_All_Reduce_Grads
- Implementation:Huggingface_Transformers_DCP_Save