Principle:Deepspeedai DeepSpeed AutoTP Configuration

Overview

Configuring automatic tensor parallelism through a combination of Pydantic training config and dataclass-based layer pattern specifications for custom model architectures.

Detailed Description

AutoTP configuration has two components:

1. TPTrainingConfig -- a Pydantic model (inheriting from DeepSpeedConfigModel) that specifies the top-level TP training settings. This is read from the tensor_parallel section of the DeepSpeed JSON config. Key fields include:

autotp_size (int, default 0): The tensor parallelism degree. When 0, AutoTP is disabled.
dtype (torch.dtype, default torch.float16): The desired model data type for TP operations.
tp_overlap_comm (bool, default False): Whether to overlap AllReduce communication with computation.
partition_config (Optional[Dict]): A dictionary specifying custom layer partitioning rules via TPLayerSpec.
preset_model (Optional[str]): A string key to select a built-in preset (e.g., "llama", "bloom", "mixtral").
tensor_parallel (nested TPConfig): Sub-config with tp_size, mpu, and tp_group.

2. AutoTPConfig -- a dataclass that defines layer-level sharding patterns using TPLayerSpec rules. Each TPLayerSpec specifies:

patterns: A list of regex patterns matching parameter names (e.g., .*\.o_proj\.weight$).
partition_type: One of ROW, COLUMN, or SKIP.
shape: Optional tuple for reshaping packed/fused weights before partitioning (e.g., (2, -1) for gate_up packed layers).
partition_dim: Optional override for the partition dimension.
model_types: Optional list restricting the spec to specific model architectures.

3. AutoTPPresets -- a collection of built-in presets for common architectures:

llama: Separate Q, K, V projections with standard row/column mapping.
bloom: Fused query_key_value with interleaved heads.
chatglm: GLM-style fused QKV with shape=(3, -1).
mixtral: Mixture-of-Experts with per-expert partitioning and MoE gate skipping.
deepseek_v2: Multi-head Latent Attention (MLA) with skip rules for low-rank projections.
qwen2: Standard LLaMA-like pattern.
phi3: Phi-3 architecture pattern.

When both a preset and custom partition_config are specified, they can be merged if use_default_specs is true (custom specs take priority, defaults fill gaps). If use_default_specs is false, only the custom specs are used.

Theoretical Basis

Tensor parallelism partitions individual weight matrices across GPUs. For a linear layer with weight W of shape [M, N]:

Column parallelism splits along the output dimension N: each GPU holds W_i of shape [M, N/tp_size]. The output is a partition of the full output, naturally distributed for the next row-parallel layer.
Row parallelism splits along the input dimension M: each GPU holds W_i of shape [M/tp_size, N]. Each GPU computes a partial sum, then an AllReduce aggregates the results.

The configuration maps each linear layer in a transformer block to its parallelism strategy. In a standard transformer:

Column-parallel: QKV projections, MLP gate/up projections (split output dimension).
Row-parallel: Attention output projection, MLP down projection (split input dimension, AllReduce output).
Skip: MoE gates, low-rank projections (not partitioned).

Reference: Megatron-LM (https://arxiv.org/abs/1909.08053).

Knowledge Sources

Relationships

Implementation:Deepspeedai_DeepSpeed_TPTrainingConfig_Init

Metadata

Workflow: AutoTP_Training
Type: Principle
Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment