Principle:Deepspeedai DeepSpeed Mesh Device Configuration

Overview

Establishing a multi-dimensional communication mesh that defines data-parallel and sequence-parallel process groups for coordinated distributed training.

Detailed Description

Mesh device configuration creates a 2D process group topology where one dimension handles data parallelism (replicating the model across groups) and the other handles sequence parallelism (splitting sequences across GPUs within a group). For example, with 8 GPUs and mesh_param=(2, 4), there are 2 data-parallel groups of 4 GPUs each, where each group of 4 performs sequence parallelism. This topology is the foundation for Ulysses sequence parallelism.

The mesh device abstraction allows the DeepSpeed runtime to correctly determine the effective world size for each dimension. The data-parallel dimension is used by ZeRO to partition optimizer states, gradients, and parameters, while the sequence-parallel dimension is used for all-to-all communication during attention computation. By encoding these relationships into a named mesh, the system can derive process groups automatically rather than requiring manual group construction.

The configuration accepts a tuple of two integers representing (dp_size, sp_size) where:

dp_size: The number of data-parallel groups. Each group receives a replica of the full model.
sp_size: The number of GPUs within each data-parallel group that cooperate on sequence parallelism.

The product dp_size * sp_size must equal the total number of GPUs (world size).

Theoretical Basis

Multi-dimensional parallelism partitions the total GPU count W into a dp_size x sp_size grid where W = dp_size * sp_size.

Data parallelism replicates the model across the rows of this grid. Gradient synchronization (allreduce) occurs within data-parallel groups.
Sequence parallelism splits the input sequence across the columns of this grid. All-to-all communication occurs within sequence-parallel groups.

This separation is critical because ZeRO optimization operates only across the data-parallel dimension. The effective world size seen by ZeRO is W / sp_size, not the full W. If the full world size were used, batch size calculations and optimizer state partitioning would be incorrect.

Parameter	Description	Example (8 GPUs)
dp_size	Number of data-parallel groups	2
sp_size	GPUs per SP group	4
Total GPUs	dp_size * sp_size	8
ZeRO world_size	dp_size	2
SP communication scope	Within each group of sp_size GPUs	GPUs [0,1,2,3] and [4,5,6,7]

Reference

DeepSpeed-Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models (https://arxiv.org/abs/2309.14509)

Related Pages

Implementation:Deepspeedai_DeepSpeed_Initialize_Mesh_Device

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment