Principle:Deepspeedai DeepSpeed AutoTP Engine Init

Overview

Automatically detecting and replacing transformer linear layers with tensor-parallel variants during DeepSpeed engine initialization.

Detailed Description

During deepspeed.initialize(), when autotp_size > 0 in the config, the AutoTP system performs the following orchestrated sequence of operations:

Set the global AutoTP mode to training via set_autotp_mode(training=True), which sets DEEPSPEED_AUTOTP_MODE = AUTOTP_MODE.TRAINING. This flag affects how TP layers behave (e.g., avoiding in-place operations for correct autograd in training mode).
Detect transformer layers using either built-in presets (via AutoTPPresets) or custom patterns (via AutoTPConfig with TPLayerSpec rules). If no custom config or preset is provided, the legacy tp_parser() heuristic analyzes the model graph to identify which linear layers are row-parallel vs column-parallel.
Replace standard nn.Linear layers with TP variants:
- LinearAllreduce for row-parallel layers (attention output projection, MLP down projection). These perform AllReduce after their forward pass to aggregate partial sums.
- LinearLayer for column-parallel layers (QKV projections, MLP gate/up projections). These distribute their identity/AllGather in the backward pass.
- Specialized variants exist for fused QKV (fused_LinearLayer), gate-up packed layers (GateUpPack_LinearLayer), LM heads (LmHeadLinearAllreduce), and sub-parameter partitioning (SubParamLinearLayer, SubParamLinearAllreduce).
Establish TP communication groups using either the provided mpu, a directly provided tp_group, or auto-created groups via _init_tp_mesh_device().
Partition weights immediately upon layer replacement. Each TP variant's __init__ calls _tp_partition() to slice the weight tensor and move the local shard to the device.

The AutoTP class orchestrates this detection and replacement process. It walks the model tree via _replace_module(), checking each child module against the configured policies or partition config patterns, and replaces matching layers in-place using setattr().

Theoretical Basis

Automatic tensor parallelism detection analyzes the model graph to identify which linear layers should be column-parallel vs row-parallel. The standard Megatron-LM partitioning pattern for a transformer block is:

Attention block: QKV projections are column-parallel (split output heads across GPUs), followed by the output projection as row-parallel (AllReduce after).
MLP block: Gate and up projections are column-parallel, followed by the down projection as row-parallel (AllReduce after).

This creates a pattern where each transformer block has exactly two AllReduce synchronization points in the forward pass (one after attention output, one after MLP down). Column-parallel layers naturally distribute activations without communication; row-parallel layers require AllReduce to sum partial results.

The _replace_module() method supports two routing paths:

Pattern-based routing (when partition_config is provided): Uses regex pattern matching on parameter names to determine the TP strategy. This is more flexible and supports custom architectures.
Type-based routing (legacy path): Uses linear_policies dictionary mapping module types to replacement functions, combined with the tp_parser() heuristic that identifies row-parallel layers by their position relative to LayerNorm boundaries.

Reference: Megatron-LM (https://arxiv.org/abs/1909.08053).

Knowledge Sources

Relationships

Implementation:Deepspeedai_DeepSpeed_AutoTP_Replace

Metadata

Workflow: AutoTP_Training
Type: Principle
Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment