Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Deepspeedai DeepSpeed Mesh Device Configuration

From Leeroopedia


Overview

Establishing a multi-dimensional communication mesh that defines data-parallel and sequence-parallel process groups for coordinated distributed training.

Detailed Description

Mesh device configuration creates a 2D process group topology where one dimension handles data parallelism (replicating the model across groups) and the other handles sequence parallelism (splitting sequences across GPUs within a group). For example, with 8 GPUs and mesh_param=(2, 4), there are 2 data-parallel groups of 4 GPUs each, where each group of 4 performs sequence parallelism. This topology is the foundation for Ulysses sequence parallelism.

The mesh device abstraction allows the DeepSpeed runtime to correctly determine the effective world size for each dimension. The data-parallel dimension is used by ZeRO to partition optimizer states, gradients, and parameters, while the sequence-parallel dimension is used for all-to-all communication during attention computation. By encoding these relationships into a named mesh, the system can derive process groups automatically rather than requiring manual group construction.

The configuration accepts a tuple of two integers representing (dp_size, sp_size) where:

  • dp_size: The number of data-parallel groups. Each group receives a replica of the full model.
  • sp_size: The number of GPUs within each data-parallel group that cooperate on sequence parallelism.

The product dp_size * sp_size must equal the total number of GPUs (world size).

Theoretical Basis

Multi-dimensional parallelism partitions the total GPU count W into a dp_size x sp_size grid where W = dp_size * sp_size.

  • Data parallelism replicates the model across the rows of this grid. Gradient synchronization (allreduce) occurs within data-parallel groups.
  • Sequence parallelism splits the input sequence across the columns of this grid. All-to-all communication occurs within sequence-parallel groups.

This separation is critical because ZeRO optimization operates only across the data-parallel dimension. The effective world size seen by ZeRO is W / sp_size, not the full W. If the full world size were used, batch size calculations and optimizer state partitioning would be incorrect.

Parameter Description Example (8 GPUs)
dp_size Number of data-parallel groups 2
sp_size GPUs per SP group 4
Total GPUs dp_size * sp_size 8
ZeRO world_size dp_size 2
SP communication scope Within each group of sp_size GPUs GPUs [0,1,2,3] and [4,5,6,7]

Reference

Related Pages

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment