Principle:Deepspeedai DeepSpeed AutoTP Engine Init
Overview
Automatically detecting and replacing transformer linear layers with tensor-parallel variants during DeepSpeed engine initialization.
Detailed Description
During deepspeed.initialize(), when autotp_size > 0 in the config, the AutoTP system performs the following orchestrated sequence of operations:
- Set the global AutoTP mode to training via
set_autotp_mode(training=True), which setsDEEPSPEED_AUTOTP_MODE = AUTOTP_MODE.TRAINING. This flag affects how TP layers behave (e.g., avoiding in-place operations for correct autograd in training mode). - Detect transformer layers using either built-in presets (via
AutoTPPresets) or custom patterns (viaAutoTPConfigwithTPLayerSpecrules). If no custom config or preset is provided, the legacytp_parser()heuristic analyzes the model graph to identify which linear layers are row-parallel vs column-parallel. - Replace standard nn.Linear layers with TP variants:
LinearAllreducefor row-parallel layers (attention output projection, MLP down projection). These perform AllReduce after their forward pass to aggregate partial sums.LinearLayerfor column-parallel layers (QKV projections, MLP gate/up projections). These distribute their identity/AllGather in the backward pass.- Specialized variants exist for fused QKV (
fused_LinearLayer), gate-up packed layers (GateUpPack_LinearLayer), LM heads (LmHeadLinearAllreduce), and sub-parameter partitioning (SubParamLinearLayer,SubParamLinearAllreduce).
- Establish TP communication groups using either the provided
mpu, a directly providedtp_group, or auto-created groups via_init_tp_mesh_device(). - Partition weights immediately upon layer replacement. Each TP variant's
__init__calls_tp_partition()to slice the weight tensor and move the local shard to the device.
The AutoTP class orchestrates this detection and replacement process. It walks the model tree via _replace_module(), checking each child module against the configured policies or partition config patterns, and replaces matching layers in-place using setattr().
Theoretical Basis
Automatic tensor parallelism detection analyzes the model graph to identify which linear layers should be column-parallel vs row-parallel. The standard Megatron-LM partitioning pattern for a transformer block is:
- Attention block: QKV projections are column-parallel (split output heads across GPUs), followed by the output projection as row-parallel (AllReduce after).
- MLP block: Gate and up projections are column-parallel, followed by the down projection as row-parallel (AllReduce after).
This creates a pattern where each transformer block has exactly two AllReduce synchronization points in the forward pass (one after attention output, one after MLP down). Column-parallel layers naturally distribute activations without communication; row-parallel layers require AllReduce to sum partial results.
The _replace_module() method supports two routing paths:
- Pattern-based routing (when
partition_configis provided): Uses regex pattern matching on parameter names to determine the TP strategy. This is more flexible and supports custom architectures. - Type-based routing (legacy path): Uses
linear_policiesdictionary mapping module types to replacement functions, combined with thetp_parser()heuristic that identifies row-parallel layers by their position relative to LayerNorm boundaries.
Reference: Megatron-LM (https://arxiv.org/abs/1909.08053).
Knowledge Sources
Relationships
Implementation:Deepspeedai_DeepSpeed_AutoTP_Replace
Metadata
- Workflow: AutoTP_Training
- Type: Principle
- Last Updated: 2026-02-09 00:00 GMT