Principle:Deepspeedai DeepSpeed Mesh Device Configuration
Overview
Establishing a multi-dimensional communication mesh that defines data-parallel and sequence-parallel process groups for coordinated distributed training.
Detailed Description
Mesh device configuration creates a 2D process group topology where one dimension handles data parallelism (replicating the model across groups) and the other handles sequence parallelism (splitting sequences across GPUs within a group). For example, with 8 GPUs and mesh_param=(2, 4), there are 2 data-parallel groups of 4 GPUs each, where each group of 4 performs sequence parallelism. This topology is the foundation for Ulysses sequence parallelism.
The mesh device abstraction allows the DeepSpeed runtime to correctly determine the effective world size for each dimension. The data-parallel dimension is used by ZeRO to partition optimizer states, gradients, and parameters, while the sequence-parallel dimension is used for all-to-all communication during attention computation. By encoding these relationships into a named mesh, the system can derive process groups automatically rather than requiring manual group construction.
The configuration accepts a tuple of two integers representing (dp_size, sp_size) where:
- dp_size: The number of data-parallel groups. Each group receives a replica of the full model.
- sp_size: The number of GPUs within each data-parallel group that cooperate on sequence parallelism.
The product dp_size * sp_size must equal the total number of GPUs (world size).
Theoretical Basis
Multi-dimensional parallelism partitions the total GPU count W into a dp_size x sp_size grid where W = dp_size * sp_size.
- Data parallelism replicates the model across the rows of this grid. Gradient synchronization (allreduce) occurs within data-parallel groups.
- Sequence parallelism splits the input sequence across the columns of this grid. All-to-all communication occurs within sequence-parallel groups.
This separation is critical because ZeRO optimization operates only across the data-parallel dimension. The effective world size seen by ZeRO is W / sp_size, not the full W. If the full world size were used, batch size calculations and optimizer state partitioning would be incorrect.
| Parameter | Description | Example (8 GPUs) |
|---|---|---|
| dp_size | Number of data-parallel groups | 2 |
| sp_size | GPUs per SP group | 4 |
| Total GPUs | dp_size * sp_size | 8 |
| ZeRO world_size | dp_size | 2 |
| SP communication scope | Within each group of sp_size GPUs | GPUs [0,1,2,3] and [4,5,6,7] |
Reference
- DeepSpeed-Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models (https://arxiv.org/abs/2309.14509)
Related Pages
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/
- https://arxiv.org/abs/2309.14509
Last updated: 2026-02-09 00:00 GMT