Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Deepspeedai DeepSpeed AutoTP Configuration

From Leeroopedia


Overview

Configuring automatic tensor parallelism through a combination of Pydantic training config and dataclass-based layer pattern specifications for custom model architectures.

Detailed Description

AutoTP configuration has two components:

1. TPTrainingConfig -- a Pydantic model (inheriting from DeepSpeedConfigModel) that specifies the top-level TP training settings. This is read from the tensor_parallel section of the DeepSpeed JSON config. Key fields include:

  • autotp_size (int, default 0): The tensor parallelism degree. When 0, AutoTP is disabled.
  • dtype (torch.dtype, default torch.float16): The desired model data type for TP operations.
  • tp_overlap_comm (bool, default False): Whether to overlap AllReduce communication with computation.
  • partition_config (Optional[Dict]): A dictionary specifying custom layer partitioning rules via TPLayerSpec.
  • preset_model (Optional[str]): A string key to select a built-in preset (e.g., "llama", "bloom", "mixtral").
  • tensor_parallel (nested TPConfig): Sub-config with tp_size, mpu, and tp_group.

2. AutoTPConfig -- a dataclass that defines layer-level sharding patterns using TPLayerSpec rules. Each TPLayerSpec specifies:

  • patterns: A list of regex patterns matching parameter names (e.g., .*\.o_proj\.weight$).
  • partition_type: One of ROW, COLUMN, or SKIP.
  • shape: Optional tuple for reshaping packed/fused weights before partitioning (e.g., (2, -1) for gate_up packed layers).
  • partition_dim: Optional override for the partition dimension.
  • model_types: Optional list restricting the spec to specific model architectures.

3. AutoTPPresets -- a collection of built-in presets for common architectures:

  • llama: Separate Q, K, V projections with standard row/column mapping.
  • bloom: Fused query_key_value with interleaved heads.
  • chatglm: GLM-style fused QKV with shape=(3, -1).
  • mixtral: Mixture-of-Experts with per-expert partitioning and MoE gate skipping.
  • deepseek_v2: Multi-head Latent Attention (MLA) with skip rules for low-rank projections.
  • qwen2: Standard LLaMA-like pattern.
  • phi3: Phi-3 architecture pattern.

When both a preset and custom partition_config are specified, they can be merged if use_default_specs is true (custom specs take priority, defaults fill gaps). If use_default_specs is false, only the custom specs are used.

Theoretical Basis

Tensor parallelism partitions individual weight matrices across GPUs. For a linear layer with weight W of shape [M, N]:

  • Column parallelism splits along the output dimension N: each GPU holds W_i of shape [M, N/tp_size]. The output is a partition of the full output, naturally distributed for the next row-parallel layer.
  • Row parallelism splits along the input dimension M: each GPU holds W_i of shape [M/tp_size, N]. Each GPU computes a partial sum, then an AllReduce aggregates the results.

The configuration maps each linear layer in a transformer block to its parallelism strategy. In a standard transformer:

  • Column-parallel: QKV projections, MLP gate/up projections (split output dimension).
  • Row-parallel: Attention output projection, MLP down projection (split input dimension, AllReduce output).
  • Skip: MoE gates, low-rank projections (not partitioned).

Reference: Megatron-LM (https://arxiv.org/abs/1909.08053).

Knowledge Sources

Relationships

Implementation:Deepspeedai_DeepSpeed_TPTrainingConfig_Init

Metadata

  • Workflow: AutoTP_Training
  • Type: Principle
  • Last Updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment