Principle:NVIDIA TransformerEngine Comm GEMM Overlap

Overview

Overlapping NCCL communication (all-gather, reduce-scatter) with GEMM computation for maximum GPU utilization during tensor-parallel training.

Description

In tensor-parallel training, communication (all-gather before forward, reduce-scatter after forward) and GEMM computation are typically sequential. Userbuffers enable overlapping these operations by using IPC-based shared memory buffers that allow NCCL communication to proceed concurrently with cuBLAS GEMM execution on different GPU resources (copy engines vs. compute SMs).

The sequential execution pattern in standard tensor-parallel training is:

All-gather the sharded input across the TP group.
Compute the GEMM (matrix multiplication) on the gathered input.
Reduce-scatter the output back across the TP group.

Each step waits for the previous one to complete, leaving either the compute units or the communication hardware idle at any given time. With comm-GEMM overlap:

The all-gather and GEMM are pipelined: as chunks of data arrive from the all-gather, GEMM computation begins on the already-arrived chunks.
The reduce-scatter and the next GEMM can similarly be overlapped.
Userbuffers provide the shared memory infrastructure that enables this pipelining without requiring explicit synchronization between communication and compute streams.

This technique is particularly effective when:

The tensor-parallel group spans multiple nodes (higher communication latency).
The communication and compute times are roughly balanced.
The model's linear layers are large enough to benefit from chunked execution.

Theoretical Basis

GPUs have independent compute units (Streaming Multiprocessors / SMs) and copy engines (DMA engines). By scheduling NCCL communication on copy engines while GEMM runs on SMs, the total latency approaches:

total_time ≈ max(comm_time, compute_time)

instead of the sequential:

total_time = comm_time + compute_time

The implementation uses IPC (Inter-Process Communication) shared memory buffers (Userbuffers) that are accessible by both the NCCL communication library and the cuBLAS GEMM kernels. This avoids the overhead of copying data between separate communication and compute buffers.

The overlap is achieved through:

Chunked execution: The input tensor is divided into chunks. Each chunk can be communicated and computed independently.
Stream-based pipelining: Communication and compute are scheduled on different CUDA streams, allowing hardware-level concurrency.
Double buffering: Two sets of buffers are used so that communication for the next chunk can proceed while computation on the current chunk is in progress.

Usage

Use in tensor-parallel training when communication overhead is significant. Common scenarios include:

Large TP groups (4+ GPUs) where all-gather and reduce-scatter latency is non-trivial.
Multi-node TP where inter-node bandwidth is limited.
Bandwidth-limited interconnects where communication time approaches or exceeds compute time.

The overlap is set up once at initialization via initialize_ub and is then automatically used by TE's linear modules (te.Linear, te.LayerNormLinear) during forward and backward passes.

Sources

TransformerEngine

Domains

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment