Principle:Deepspeedai DeepSpeed AutoTP Configuration
Overview
Configuring automatic tensor parallelism through a combination of Pydantic training config and dataclass-based layer pattern specifications for custom model architectures.
Detailed Description
AutoTP configuration has two components:
1. TPTrainingConfig -- a Pydantic model (inheriting from DeepSpeedConfigModel) that specifies the top-level TP training settings. This is read from the tensor_parallel section of the DeepSpeed JSON config. Key fields include:
autotp_size(int, default 0): The tensor parallelism degree. When 0, AutoTP is disabled.dtype(torch.dtype, default torch.float16): The desired model data type for TP operations.tp_overlap_comm(bool, default False): Whether to overlap AllReduce communication with computation.partition_config(Optional[Dict]): A dictionary specifying custom layer partitioning rules viaTPLayerSpec.preset_model(Optional[str]): A string key to select a built-in preset (e.g., "llama", "bloom", "mixtral").tensor_parallel(nestedTPConfig): Sub-config withtp_size,mpu, andtp_group.
2. AutoTPConfig -- a dataclass that defines layer-level sharding patterns using TPLayerSpec rules. Each TPLayerSpec specifies:
patterns: A list of regex patterns matching parameter names (e.g.,.*\.o_proj\.weight$).partition_type: One ofROW,COLUMN, orSKIP.shape: Optional tuple for reshaping packed/fused weights before partitioning (e.g.,(2, -1)for gate_up packed layers).partition_dim: Optional override for the partition dimension.model_types: Optional list restricting the spec to specific model architectures.
3. AutoTPPresets -- a collection of built-in presets for common architectures:
- llama: Separate Q, K, V projections with standard row/column mapping.
- bloom: Fused query_key_value with interleaved heads.
- chatglm: GLM-style fused QKV with
shape=(3, -1). - mixtral: Mixture-of-Experts with per-expert partitioning and MoE gate skipping.
- deepseek_v2: Multi-head Latent Attention (MLA) with skip rules for low-rank projections.
- qwen2: Standard LLaMA-like pattern.
- phi3: Phi-3 architecture pattern.
When both a preset and custom partition_config are specified, they can be merged if use_default_specs is true (custom specs take priority, defaults fill gaps). If use_default_specs is false, only the custom specs are used.
Theoretical Basis
Tensor parallelism partitions individual weight matrices across GPUs. For a linear layer with weight W of shape [M, N]:
- Column parallelism splits along the output dimension N: each GPU holds W_i of shape [M, N/tp_size]. The output is a partition of the full output, naturally distributed for the next row-parallel layer.
- Row parallelism splits along the input dimension M: each GPU holds W_i of shape [M/tp_size, N]. Each GPU computes a partial sum, then an AllReduce aggregates the results.
The configuration maps each linear layer in a transformer block to its parallelism strategy. In a standard transformer:
- Column-parallel: QKV projections, MLP gate/up projections (split output dimension).
- Row-parallel: Attention output projection, MLP down projection (split input dimension, AllReduce output).
- Skip: MoE gates, low-rank projections (not partitioned).
Reference: Megatron-LM (https://arxiv.org/abs/1909.08053).
Knowledge Sources
Relationships
Implementation:Deepspeedai_DeepSpeed_TPTrainingConfig_Init
Metadata
- Workflow: AutoTP_Training
- Type: Principle
- Last Updated: 2026-02-09 00:00 GMT