Principle:Deepspeedai DeepSpeed Tensor Parallel Training
Overview
Training with tensor-parallel layers that automatically handle AllReduce communication in forward and backward passes for distributed weight matrix computation.
Detailed Description
After AutoTP replaces linear layers, the training loop proceeds normally via engine.backward() and engine.step(). The TP layers handle communication transparently -- the user's training code does not need to manage any distributed communication.
The two primary TP layer types behave differently during forward and backward:
LinearAllreduce (row-parallel):
- Forward: Computes
output = input @ weight.T, then appliesRowParallelwhich performs an AllReduce across TP ranks to sum the partial results. In training mode, the AllReduce is performed with gradient-aware autograd (not in-place) to ensure correct backpropagation. - Backward: Gradients flow back through the AllReduce operation automatically via PyTorch autograd.
LinearLayer (column-parallel):
- Forward: Applies
ColumnParallelwhich is an identity operation in forward (or optionally an AllGather in overlapped mode viaAsyncColumnParallel), then computesoutput = input @ weight.T. The output is a partition of the full output. - Backward: The
ColumnParallelautograd function handles the AllReduce of gradients in the backward pass.
When tp_overlap_comm is enabled, the column-parallel layers use AsyncColumnParallel which overlaps AllReduce communication with computation for improved throughput.
GatherReplacedLayerParams is a context manager that temporarily gathers full (un-partitioned) parameters from all TP ranks. This is essential for:
- Checkpointing: Saving complete model weights (see AutoTP_Model_Saving).
- Evaluation: Running inference on the full model without TP artifacts.
- Custom operations: Any operation that requires the full parameter tensor.
The context manager works by calling gather_params() on enter (which performs AllGather to reconstruct full tensors) and _tp_partition() on exit (which re-slices back to partitioned form).
Training mode differences from inference mode:
- Bias is added via
input + bias(not in-place) to preserve autograd graph integrity. - Weight partitioning uses even
torch.chunk()rather than uneven shard sizes. - TP parameters are configured with
requires_grad=Trueand marked withtensor_model_parallelandds_is_replaced_moduleattributes for the optimizer and checkpoint system.
Theoretical Basis
For row-parallel linear computation Y = XW where W is split row-wise (along the input dimension) as [W_1; W_2; ...; W_tp_size]:
- Each GPU i computes Y_i = X * W_i (a partial sum of the output).
- An AllReduce sums the partial results: Y = sum(Y_1, Y_2, ..., Y_tp_size).
- This is implemented by
LinearAllreduce, where the AllReduce is in the forward pass.
For column-parallel linear computation Y = XW where W is split column-wise (along the output dimension) as [W_1, W_2, ..., W_tp_size]:
- Each GPU i computes Y_i = X * W_i (a partition of the output along the output dimension).
- No communication is needed in forward; the next layer's row-parallel input is already distributed.
- AllReduce of gradients happens in the backward pass via the
ColumnParallelautograd function.
This pairing creates a complete TP communication pattern per transformer block with exactly two AllReduce operations in the forward pass (one after attention output, one after MLP down), as described in the Megatron-LM paper.
Reference: Megatron-LM (https://arxiv.org/abs/1909.08053).
Knowledge Sources
Relationships
Implementation:Deepspeedai_DeepSpeed_LinearAllreduce_Forward
Metadata
- Workflow: AutoTP_Training
- Type: Principle
- Last Updated: 2026-02-09 00:00 GMT