Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Deepspeedai DeepSpeed AutoTP Engine Init

From Leeroopedia


Overview

Automatically detecting and replacing transformer linear layers with tensor-parallel variants during DeepSpeed engine initialization.

Detailed Description

During deepspeed.initialize(), when autotp_size > 0 in the config, the AutoTP system performs the following orchestrated sequence of operations:

  1. Set the global AutoTP mode to training via set_autotp_mode(training=True), which sets DEEPSPEED_AUTOTP_MODE = AUTOTP_MODE.TRAINING. This flag affects how TP layers behave (e.g., avoiding in-place operations for correct autograd in training mode).
  2. Detect transformer layers using either built-in presets (via AutoTPPresets) or custom patterns (via AutoTPConfig with TPLayerSpec rules). If no custom config or preset is provided, the legacy tp_parser() heuristic analyzes the model graph to identify which linear layers are row-parallel vs column-parallel.
  3. Replace standard nn.Linear layers with TP variants:
    • LinearAllreduce for row-parallel layers (attention output projection, MLP down projection). These perform AllReduce after their forward pass to aggregate partial sums.
    • LinearLayer for column-parallel layers (QKV projections, MLP gate/up projections). These distribute their identity/AllGather in the backward pass.
    • Specialized variants exist for fused QKV (fused_LinearLayer), gate-up packed layers (GateUpPack_LinearLayer), LM heads (LmHeadLinearAllreduce), and sub-parameter partitioning (SubParamLinearLayer, SubParamLinearAllreduce).
  4. Establish TP communication groups using either the provided mpu, a directly provided tp_group, or auto-created groups via _init_tp_mesh_device().
  5. Partition weights immediately upon layer replacement. Each TP variant's __init__ calls _tp_partition() to slice the weight tensor and move the local shard to the device.

The AutoTP class orchestrates this detection and replacement process. It walks the model tree via _replace_module(), checking each child module against the configured policies or partition config patterns, and replaces matching layers in-place using setattr().

Theoretical Basis

Automatic tensor parallelism detection analyzes the model graph to identify which linear layers should be column-parallel vs row-parallel. The standard Megatron-LM partitioning pattern for a transformer block is:

  • Attention block: QKV projections are column-parallel (split output heads across GPUs), followed by the output projection as row-parallel (AllReduce after).
  • MLP block: Gate and up projections are column-parallel, followed by the down projection as row-parallel (AllReduce after).

This creates a pattern where each transformer block has exactly two AllReduce synchronization points in the forward pass (one after attention output, one after MLP down). Column-parallel layers naturally distribute activations without communication; row-parallel layers require AllReduce to sum partial results.

The _replace_module() method supports two routing paths:

  • Pattern-based routing (when partition_config is provided): Uses regex pattern matching on parameter names to determine the TP strategy. This is more flexible and supports custom architectures.
  • Type-based routing (legacy path): Uses linear_policies dictionary mapping module types to replacement functions, combined with the tp_parser() heuristic that identifies row-parallel layers by their position relative to LayerNorm boundaries.

Reference: Megatron-LM (https://arxiv.org/abs/1909.08053).

Knowledge Sources

Relationships

Implementation:Deepspeedai_DeepSpeed_AutoTP_Replace

Metadata

  • Workflow: AutoTP_Training
  • Type: Principle
  • Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment