Principle:Deepspeedai DeepSpeed SP Engine Init
Overview
Initializing the DeepSpeed engine with sequence parallelism support by combining mesh device configuration, adjusted world sizes, and SP-adapted model components.
Detailed Description
When deepspeed.initialize() is called with mesh_param (or a config containing sequence_parallel_size and data_parallel_size), it creates a mesh device and adjusts the effective world size for ZeRO optimization. The world size used for ZeRO is the data-parallel dimension (total_gpus / sp_size), not the full world. The mpu object from register_with_transformers() provides the SP process group for communication.
The initialization process involves several coordinated steps:
- Mesh device creation: The mesh is created with shape
(dp_size, sp_size)and named dimensions("data_parallel", "sequence_parallel"). - World size adjustment:
DeepSpeedConfigreceives themesh_deviceand uses the data-parallel group's world size rather than the total world size. This ensures correct batch size calculations and ZeRO partitioning. - Engine construction: The
DeepSpeedEngineis constructed with the adjusted configuration, thempufor SP group access, and themesh_devicefor correct process group routing.
The world size adjustment is particularly important because ZeRO stages (1, 2, 3) partition optimizer states, gradients, and parameters across what they consider the "world." With SP, the relevant world for ZeRO is only the data-parallel dimension, since all GPUs in an SP group process different parts of the same sequence and must have identical model weights.
Theoretical Basis
With sequence parallelism, the effective data-parallel world size is:
dp_world_size = W / sp_size
where W is the total number of GPUs.
- ZeRO partitions optimizer states, gradients, and parameters across the DP dimension only (across
dp_world_sizeranks). - SP communication (all-to-all for attention) occurs orthogonally within SP groups of size
sp_size. - The
mesh_deviceensures correct group assignment so that each rank knows its DP peers and SP peers.
The DeepSpeedConfig determines the world size through a priority chain:
- If
mpuis provided and hasget_data_parallel_world_size(), use that. - If
mpuis provided without that method (Ulysses SP case), computedist.get_world_size() / mpu.get_sequence_parallel_world_size(). - If
mesh_deviceis provided, use the data-parallel group's world size. - If
sequence_parallel_sizeis in the config, computedist.get_world_size() / sequence_parallel_size. - Otherwise, use the full world size.
Related Pages
- Implementation:Deepspeedai_DeepSpeed_Initialize_For_SP
- Heuristic:Deepspeedai_DeepSpeed_Sequence_Parallel_PyTorch_Version
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/
- https://arxiv.org/abs/2309.14509
Last updated: 2026-02-09 00:00 GMT