Principle:Deepspeedai DeepSpeed Tensor Parallel Training

Overview

Training with tensor-parallel layers that automatically handle AllReduce communication in forward and backward passes for distributed weight matrix computation.

Detailed Description

After AutoTP replaces linear layers, the training loop proceeds normally via engine.backward() and engine.step(). The TP layers handle communication transparently -- the user's training code does not need to manage any distributed communication.

The two primary TP layer types behave differently during forward and backward:

LinearAllreduce (row-parallel):

Forward: Computes output = input @ weight.T, then applies RowParallel which performs an AllReduce across TP ranks to sum the partial results. In training mode, the AllReduce is performed with gradient-aware autograd (not in-place) to ensure correct backpropagation.
Backward: Gradients flow back through the AllReduce operation automatically via PyTorch autograd.

LinearLayer (column-parallel):

Forward: Applies ColumnParallel which is an identity operation in forward (or optionally an AllGather in overlapped mode via AsyncColumnParallel), then computes output = input @ weight.T. The output is a partition of the full output.
Backward: The ColumnParallel autograd function handles the AllReduce of gradients in the backward pass.

When tp_overlap_comm is enabled, the column-parallel layers use AsyncColumnParallel which overlaps AllReduce communication with computation for improved throughput.

GatherReplacedLayerParams is a context manager that temporarily gathers full (un-partitioned) parameters from all TP ranks. This is essential for:

Checkpointing: Saving complete model weights (see AutoTP_Model_Saving).
Evaluation: Running inference on the full model without TP artifacts.
Custom operations: Any operation that requires the full parameter tensor.

The context manager works by calling gather_params() on enter (which performs AllGather to reconstruct full tensors) and _tp_partition() on exit (which re-slices back to partitioned form).

Training mode differences from inference mode:

Bias is added via input + bias (not in-place) to preserve autograd graph integrity.
Weight partitioning uses even torch.chunk() rather than uneven shard sizes.
TP parameters are configured with requires_grad=True and marked with tensor_model_parallel and ds_is_replaced_module attributes for the optimizer and checkpoint system.

Theoretical Basis

For row-parallel linear computation Y = XW where W is split row-wise (along the input dimension) as [W_1; W_2; ...; W_tp_size]:

Each GPU i computes Y_i = X * W_i (a partial sum of the output).
An AllReduce sums the partial results: Y = sum(Y_1, Y_2, ..., Y_tp_size).
This is implemented by LinearAllreduce, where the AllReduce is in the forward pass.

For column-parallel linear computation Y = XW where W is split column-wise (along the output dimension) as [W_1, W_2, ..., W_tp_size]:

Each GPU i computes Y_i = X * W_i (a partition of the output along the output dimension).
No communication is needed in forward; the next layer's row-parallel input is already distributed.
AllReduce of gradients happens in the backward pass via the ColumnParallel autograd function.

This pairing creates a complete TP communication pattern per transformer block with exactly two AllReduce operations in the forward pass (one after attention output, one after MLP down), as described in the Megatron-LM paper.

Reference: Megatron-LM (https://arxiv.org/abs/1909.08053).

Knowledge Sources

Relationships

Implementation:Deepspeedai_DeepSpeed_LinearAllreduce_Forward

Metadata

Workflow: AutoTP_Training
Type: Principle
Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment