Implementation:Deepspeedai DeepSpeed Initialize For SP

Overview

Concrete tool for initializing a DeepSpeed engine with sequence parallelism support provided by the DeepSpeed library.

Description

deepspeed.initialize() with mesh_param=(dp_size, sp_size) creates a mesh device, adjusts the world_size for the DP dimension, and passes the mesh_device to DeepSpeedConfig for correct batch size calculation. The mpu parameter from register_with_transformers() provides the SP group for communication.

When mesh_param is provided, the initialization flow is:

Call dist.initialize_mesh_device(mesh_param, ("data_parallel", "sequence_parallel")) to create the mesh
Pass mesh_device to DeepSpeedConfig, which extracts the data-parallel group's world size via mesh_device.get_group(mesh_dim="data_parallel")
Construct DeepSpeedEngine with the adjusted config, mpu, and mesh_device

Alternatively, if mesh_param is not provided but sequence_parallel_size and data_parallel_size are present in the config dictionary, the mesh is created from those config values instead.

Code Reference

Repository: https://github.com/deepspeedai/DeepSpeed
File: deepspeed/__init__.py (L80-252, initialize), deepspeed/runtime/config.py (L668-687, SP config parsing)

Signature

deepspeed.initialize(
    model=model,          # nn.Module, required
    config=config,        # dict or str path, required
    mesh_param=None,      # tuple (dp_size, sp_size), optional
    mpu=None,             # mpu object from register_with_transformers, optional
    optimizer=None,       # optional user-defined optimizer
    model_parameters=None,# optional parameter groups
    lr_scheduler=None,    # optional scheduler
    # ... other standard parameters
) -> Tuple[DeepSpeedEngine, Optimizer, DataLoader, LRScheduler]

Import

import deepspeed

I/O Contract

Inputs

Parameter	Type	Required	Description
model	torch.nn.Module	Yes	The model (with SP attention already registered)
config	dict or str	Yes	DeepSpeed configuration dictionary or path to JSON config file
mesh_param	tuple	No	`(dp_size, sp_size)` defining the mesh dimensions
mpu	object	No	Model parallel unit from `register_with_transformers()`
optimizer	Optimizer	No	User-defined optimizer (overrides config)
model_parameters	iterable	No	Parameters to optimize
lr_scheduler	LRScheduler	No	Learning rate scheduler

Outputs

Output	Type	Description
engine	DeepSpeedEngine	Runtime engine with correct world_size for SP, mesh_device for SP group communication
optimizer	Optimizer	Wrapped optimizer (or None)
training_dataloader	DataLoader	DeepSpeed dataloader (or None if no training_data provided)
lr_scheduler	LRScheduler	Wrapped scheduler (or None)

Usage Example

import deepspeed
from deepspeed.runtime.sequence_parallel.ulysses_sp import UlyssesSPAttentionHF

# Step 1: Register SP attention (before model instantiation or after)
mpu = UlyssesSPAttentionHF.register_with_transformers(
    model_name_or_path="meta-llama/Llama-2-7b-hf",
    core_attn_implementation="flash_attention_2",
    sequence_parallel_size=4,
    micro_batch_size=1,
)

# Step 2: Initialize the engine with mesh_param
engine, optimizer, _, lr_scheduler = deepspeed.initialize(
    model=model,
    config=ds_config,
    mesh_param=(2, 4),  # 2 DP groups x 4 SP within each
    mpu=mpu,
)

# Alternative: specify in config instead of mesh_param
ds_config_with_sp = {
    "data_parallel_size": 2,
    "sequence_parallel_size": 4,
    "train_micro_batch_size_per_gpu": 1,
    # ... other config
}
engine, _, _, _ = deepspeed.initialize(
    model=model,
    config=ds_config_with_sp,
    mpu=mpu,
)

Related Pages

Knowledge Sources

Last updated: 2026-02-09 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment