Principle:Deepspeedai DeepSpeed SP Engine Init

Overview

Initializing the DeepSpeed engine with sequence parallelism support by combining mesh device configuration, adjusted world sizes, and SP-adapted model components.

Detailed Description

When deepspeed.initialize() is called with mesh_param (or a config containing sequence_parallel_size and data_parallel_size), it creates a mesh device and adjusts the effective world size for ZeRO optimization. The world size used for ZeRO is the data-parallel dimension (total_gpus / sp_size), not the full world. The mpu object from register_with_transformers() provides the SP process group for communication.

The initialization process involves several coordinated steps:

Mesh device creation: The mesh is created with shape (dp_size, sp_size) and named dimensions ("data_parallel", "sequence_parallel").
World size adjustment: DeepSpeedConfig receives the mesh_device and uses the data-parallel group's world size rather than the total world size. This ensures correct batch size calculations and ZeRO partitioning.
Engine construction: The DeepSpeedEngine is constructed with the adjusted configuration, the mpu for SP group access, and the mesh_device for correct process group routing.

The world size adjustment is particularly important because ZeRO stages (1, 2, 3) partition optimizer states, gradients, and parameters across what they consider the "world." With SP, the relevant world for ZeRO is only the data-parallel dimension, since all GPUs in an SP group process different parts of the same sequence and must have identical model weights.

Theoretical Basis

With sequence parallelism, the effective data-parallel world size is:

dp_world_size = W / sp_size

where W is the total number of GPUs.

ZeRO partitions optimizer states, gradients, and parameters across the DP dimension only (across dp_world_size ranks).
SP communication (all-to-all for attention) occurs orthogonally within SP groups of size sp_size.
The mesh_device ensures correct group assignment so that each rank knows its DP peers and SP peers.

The DeepSpeedConfig determines the world size through a priority chain:

If mpu is provided and has get_data_parallel_world_size(), use that.
If mpu is provided without that method (Ulysses SP case), compute dist.get_world_size() / mpu.get_sequence_parallel_world_size().
If mesh_device is provided, use the data-parallel group's world size.
If sequence_parallel_size is in the config, compute dist.get_world_size() / sequence_parallel_size.
Otherwise, use the full world size.

Related Pages

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment