Principle:NVIDIA TransformerEngine Tensor Parallel Initialization

Overview

Setting up tensor-parallel and data-parallel process groups for distributed multi-GPU training.

Description

Tensor parallelism splits model parameters across GPUs within a group. The initialization step creates NCCL process groups for tensor-parallel communication (all-gather, reduce-scatter within the TP group) and optional data-parallel communication (gradient sync across DP replicas). Proper group setup is a prerequisite for all TP-aware TE modules.

The initialization process involves:

Creating a TP process group that contains the GPUs sharing model shards. Within this group, all-gather and reduce-scatter operations distribute activations and gradients for the split dimensions.
Creating a DP process group (optional) that contains GPUs holding the same model shard across different data replicas. This group handles gradient averaging across replicas.
Setting device affinity by assigning each rank to its corresponding local GPU via torch.cuda.set_device.
Passing group references to TE modules so they can perform the appropriate collective operations during forward and backward passes.

Without correct group initialization, TE modules cannot perform tensor-parallel GEMM operations, as they depend on process groups for:

Gathering sharded weight matrices before forward computation.
Scattering/reducing activation gradients during backward computation.
Synchronizing FP8 scaling metadata across the TP group.

Theoretical Basis

Given W (world_size) total GPUs and T (tp_size) GPUs per tensor-parallel group:

There are W / T data-parallel replicas.
The TP group contains GPUs that collectively hold one complete copy of the model. For example, with W=8 and T=4, TP groups are [0,1,2,3] and [4,5,6,7].
The DP group contains GPUs that hold the same shard across different replicas. For the same example, DP groups are [0,4], [1,5], [2,6], [3,7].

Groups are defined by rank enumeration and created via torch.distributed.new_subgroups_by_enumeration or equivalent APIs. Each GPU participates in exactly one TP group and one DP group.

The TP rank within a group determines which shard of the model parameters that GPU is responsible for. For column-parallel linear layers, each GPU holds hidden_size / tp_size output columns; for row-parallel layers, each holds hidden_size / tp_size input rows.

Usage

Use as the first step when setting up tensor-parallel training with TransformerEngine. The initialization must happen before:

Constructing any TE modules (they require tp_group and tp_size at construction time).
Initializing userbuffers for comm-GEMM overlap.
Setting up RNG state tracking for model-parallel dropout.

The typical initialization sequence is:

Initialize torch.distributed with NCCL backend.
Create TP and DP process groups.
Set up CudaRNGStatesTracker with rank-dependent seeds.
Construct TE model with tp_group and tp_size.
Optionally initialize userbuffers for comm-GEMM overlap.

Sources

TransformerEngine

Domains

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment