Principle:NVIDIA TransformerEngine Tensor Parallel Initialization
Overview
Setting up tensor-parallel and data-parallel process groups for distributed multi-GPU training.
Description
Tensor parallelism splits model parameters across GPUs within a group. The initialization step creates NCCL process groups for tensor-parallel communication (all-gather, reduce-scatter within the TP group) and optional data-parallel communication (gradient sync across DP replicas). Proper group setup is a prerequisite for all TP-aware TE modules.
The initialization process involves:
- Creating a TP process group that contains the GPUs sharing model shards. Within this group, all-gather and reduce-scatter operations distribute activations and gradients for the split dimensions.
- Creating a DP process group (optional) that contains GPUs holding the same model shard across different data replicas. This group handles gradient averaging across replicas.
- Setting device affinity by assigning each rank to its corresponding local GPU via
torch.cuda.set_device. - Passing group references to TE modules so they can perform the appropriate collective operations during forward and backward passes.
Without correct group initialization, TE modules cannot perform tensor-parallel GEMM operations, as they depend on process groups for:
- Gathering sharded weight matrices before forward computation.
- Scattering/reducing activation gradients during backward computation.
- Synchronizing FP8 scaling metadata across the TP group.
Theoretical Basis
Given W (world_size) total GPUs and T (tp_size) GPUs per tensor-parallel group:
- There are
W / Tdata-parallel replicas. - The TP group contains GPUs that collectively hold one complete copy of the model. For example, with
W=8andT=4, TP groups are[0,1,2,3]and[4,5,6,7]. - The DP group contains GPUs that hold the same shard across different replicas. For the same example, DP groups are
[0,4],[1,5],[2,6],[3,7].
Groups are defined by rank enumeration and created via torch.distributed.new_subgroups_by_enumeration or equivalent APIs. Each GPU participates in exactly one TP group and one DP group.
The TP rank within a group determines which shard of the model parameters that GPU is responsible for. For column-parallel linear layers, each GPU holds hidden_size / tp_size output columns; for row-parallel layers, each holds hidden_size / tp_size input rows.
Usage
Use as the first step when setting up tensor-parallel training with TransformerEngine. The initialization must happen before:
- Constructing any TE modules (they require
tp_groupandtp_sizeat construction time). - Initializing userbuffers for comm-GEMM overlap.
- Setting up RNG state tracking for model-parallel dropout.
The typical initialization sequence is:
- Initialize
torch.distributedwith NCCL backend. - Create TP and DP process groups.
- Set up
CudaRNGStatesTrackerwith rank-dependent seeds. - Construct TE model with
tp_groupandtp_size. - Optionally initialize userbuffers for comm-GEMM overlap.