Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:NVIDIA TransformerEngine Tensor Parallel Initialization

From Leeroopedia


Overview

Setting up tensor-parallel and data-parallel process groups for distributed multi-GPU training.

Description

Tensor parallelism splits model parameters across GPUs within a group. The initialization step creates NCCL process groups for tensor-parallel communication (all-gather, reduce-scatter within the TP group) and optional data-parallel communication (gradient sync across DP replicas). Proper group setup is a prerequisite for all TP-aware TE modules.

The initialization process involves:

  • Creating a TP process group that contains the GPUs sharing model shards. Within this group, all-gather and reduce-scatter operations distribute activations and gradients for the split dimensions.
  • Creating a DP process group (optional) that contains GPUs holding the same model shard across different data replicas. This group handles gradient averaging across replicas.
  • Setting device affinity by assigning each rank to its corresponding local GPU via torch.cuda.set_device.
  • Passing group references to TE modules so they can perform the appropriate collective operations during forward and backward passes.

Without correct group initialization, TE modules cannot perform tensor-parallel GEMM operations, as they depend on process groups for:

  • Gathering sharded weight matrices before forward computation.
  • Scattering/reducing activation gradients during backward computation.
  • Synchronizing FP8 scaling metadata across the TP group.

Theoretical Basis

Given W (world_size) total GPUs and T (tp_size) GPUs per tensor-parallel group:

  • There are W / T data-parallel replicas.
  • The TP group contains GPUs that collectively hold one complete copy of the model. For example, with W=8 and T=4, TP groups are [0,1,2,3] and [4,5,6,7].
  • The DP group contains GPUs that hold the same shard across different replicas. For the same example, DP groups are [0,4], [1,5], [2,6], [3,7].

Groups are defined by rank enumeration and created via torch.distributed.new_subgroups_by_enumeration or equivalent APIs. Each GPU participates in exactly one TP group and one DP group.

The TP rank within a group determines which shard of the model parameters that GPU is responsible for. For column-parallel linear layers, each GPU holds hidden_size / tp_size output columns; for row-parallel layers, each holds hidden_size / tp_size input rows.

Usage

Use as the first step when setting up tensor-parallel training with TransformerEngine. The initialization must happen before:

  • Constructing any TE modules (they require tp_group and tp_size at construction time).
  • Initializing userbuffers for comm-GEMM overlap.
  • Setting up RNG state tracking for model-parallel dropout.

The typical initialization sequence is:

  1. Initialize torch.distributed with NCCL backend.
  2. Create TP and DP process groups.
  3. Set up CudaRNGStatesTracker with rank-dependent seeds.
  4. Construct TE model with tp_group and tp_size.
  5. Optionally initialize userbuffers for comm-GEMM overlap.

Related

Sources

Domains

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment