Implementation:Deepspeedai DeepSpeed Initialize For SP
Overview
Concrete tool for initializing a DeepSpeed engine with sequence parallelism support provided by the DeepSpeed library.
Description
deepspeed.initialize() with mesh_param=(dp_size, sp_size) creates a mesh device, adjusts the world_size for the DP dimension, and passes the mesh_device to DeepSpeedConfig for correct batch size calculation. The mpu parameter from register_with_transformers() provides the SP group for communication.
When mesh_param is provided, the initialization flow is:
- Call
dist.initialize_mesh_device(mesh_param, ("data_parallel", "sequence_parallel"))to create the mesh - Pass
mesh_devicetoDeepSpeedConfig, which extracts the data-parallel group's world size viamesh_device.get_group(mesh_dim="data_parallel") - Construct
DeepSpeedEnginewith the adjusted config,mpu, andmesh_device
Alternatively, if mesh_param is not provided but sequence_parallel_size and data_parallel_size are present in the config dictionary, the mesh is created from those config values instead.
Code Reference
- Repository: https://github.com/deepspeedai/DeepSpeed
- File:
deepspeed/__init__.py(L80-252,initialize),deepspeed/runtime/config.py(L668-687, SP config parsing)
Signature
deepspeed.initialize(
model=model, # nn.Module, required
config=config, # dict or str path, required
mesh_param=None, # tuple (dp_size, sp_size), optional
mpu=None, # mpu object from register_with_transformers, optional
optimizer=None, # optional user-defined optimizer
model_parameters=None,# optional parameter groups
lr_scheduler=None, # optional scheduler
# ... other standard parameters
) -> Tuple[DeepSpeedEngine, Optimizer, DataLoader, LRScheduler]
Import
import deepspeed
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | torch.nn.Module | Yes | The model (with SP attention already registered) |
| config | dict or str | Yes | DeepSpeed configuration dictionary or path to JSON config file |
| mesh_param | tuple | No | (dp_size, sp_size) defining the mesh dimensions
|
| mpu | object | No | Model parallel unit from register_with_transformers()
|
| optimizer | Optimizer | No | User-defined optimizer (overrides config) |
| model_parameters | iterable | No | Parameters to optimize |
| lr_scheduler | LRScheduler | No | Learning rate scheduler |
Outputs
| Output | Type | Description |
|---|---|---|
| engine | DeepSpeedEngine | Runtime engine with correct world_size for SP, mesh_device for SP group communication |
| optimizer | Optimizer | Wrapped optimizer (or None) |
| training_dataloader | DataLoader | DeepSpeed dataloader (or None if no training_data provided) |
| lr_scheduler | LRScheduler | Wrapped scheduler (or None) |
Usage Example
import deepspeed
from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPAttentionHF
# Step 1: Register SP attention (before model instantiation or after)
mpu = UlyssesSPAttentionHF.register_with_transformers(
model_name_or_path="meta-llama/Llama-2-7b-hf",
core_attn_implementation="flash_attention_2",
sequence_parallel_size=4,
micro_batch_size=1,
)
# Step 2: Initialize the engine with mesh_param
engine, optimizer, _, lr_scheduler = deepspeed.initialize(
model=model,
config=ds_config,
mesh_param=(2, 4), # 2 DP groups x 4 SP within each
mpu=mpu,
)
# Alternative: specify in config instead of mesh_param
ds_config_with_sp = {
"data_parallel_size": 2,
"sequence_parallel_size": 4,
"train_micro_batch_size_per_gpu": 1,
# ... other config
}
engine, _, _, _ = deepspeed.initialize(
model=model,
config=ds_config_with_sp,
mpu=mpu,
)
Related Pages
- Principle:Deepspeedai_DeepSpeed_SP_Engine_Init
- Heuristic:Deepspeedai_DeepSpeed_Sequence_Parallel_PyTorch_Version
Knowledge Sources
- https://github.com/deepspeedai/DeepSpeed
- https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/
Last updated: 2026-02-09 00:00 GMT