Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepspeedai DeepSpeed Initialize

From Leeroopedia


Knowledge Sources
Domains Distributed_Training, Training_Orchestration, Memory_Optimization
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for creating a DeepSpeed distributed training engine provided by the DeepSpeed library.

Description

The deepspeed.initialize() function is the main entry point for DeepSpeed training. It takes a PyTorch model, optional optimizer, and configuration, then returns a DeepSpeedEngine (or PipelineEngine for PipelineModule, or DeepSpeedHybridEngine when hybrid_engine.enabled=True). It handles:

  • Distributed backend initialization: Calls dist.init_distributed() with the appropriate backend (NCCL, etc.)
  • Config parsing: Creates a DeepSpeedConfig object from the provided config file or dictionary
  • Zero.Init context management: Shuts down any active zero.Init context before engine construction, then restores it afterward
  • Engine type routing:
    • PipelineModule input routes to PipelineEngine
    • hybrid_engine.enabled=True routes to DeepSpeedHybridEngine
    • Otherwise routes to DeepSpeedEngine
  • Mesh device setup: Initializes device mesh for sequence parallelism from mesh_param or config
  • AutoTP integration: Merges tensor parallelism config and sets AutoTP mode if configured
  • Optimizer flag setting: Marks parameters for specialized optimizers (e.g., Muon)

Usage

Call this function once before the training loop. The model parameter is required; all others are optional. The returned 4-tuple provides all objects needed for the training loop.

Code Reference

Source Location

  • Repository: DeepSpeed
  • File: deepspeed/__init__.py
  • Lines: 80-252

Signature

def initialize(args=None,
               model: torch.nn.Module = None,
               optimizer: Optional[Union[Optimizer, DeepSpeedOptimizerCallable]] = None,
               model_parameters: Optional[torch.nn.Module] = None,
               training_data: Optional[torch.utils.data.Dataset] = None,
               lr_scheduler: Optional[Union[_LRScheduler, DeepSpeedSchedulerCallable]] = None,
               distributed_port: int = TORCH_DISTRIBUTED_DEFAULT_PORT,
               mpu=None,
               dist_init_required: Optional[bool] = None,
               collate_fn=None,
               config=None,
               mesh_param=None,
               config_params=None):

Import

import deepspeed

engine, optimizer, dataloader, lr_scheduler = deepspeed.initialize(...)

I/O Contract

Inputs

Name Type Required Description
model torch.nn.Module Yes The PyTorch model to wrap with the DeepSpeed engine
optimizer Union[Optimizer, Callable] No User-defined optimizer or callable that returns one; overrides JSON config optimizer
model_parameters iterable No Iterable of torch.Tensors or dicts specifying which tensors to optimize
training_data torch.utils.data.Dataset No Training dataset; DeepSpeed creates a DataLoader if provided
lr_scheduler Union[_LRScheduler, Callable] No Learning rate scheduler object or callable that takes an optimizer
config Union[str, dict] Yes DeepSpeed JSON config file path or dictionary (or via args.deepspeed_config)
args object No Object with local_rank and deepspeed_config fields (alternative to config parameter)
distributed_port int No Master node port for distributed communication (default: 29500)
mpu object No Model parallelism unit implementing get_{model,data}_parallel_{rank,group,world_size}()
dist_init_required Optional[bool] No Force or skip torch.distributed initialization (None for auto-detect)
collate_fn Callable No Custom collate function for the DataLoader
mesh_param tuple No Mesh parameters for device mesh initialization (data_parallel, sequence_parallel)
config_params Union[str, dict] No Same as config, kept for backwards compatibility

Outputs

Name Type Description
engine DeepSpeedEngine The DeepSpeed runtime engine wrapping the model for distributed training
optimizer Optimizer Wrapped optimizer (user-defined or from config); None if not configured
training_dataloader DataLoader DeepSpeed DataLoader if training_data was supplied; otherwise None
lr_scheduler _LRScheduler Wrapped LR scheduler if provided or configured in JSON; otherwise None

Usage Examples

import deepspeed
import torch
import torch.nn as nn

# Define a simple model
model = nn.Linear(1024, 1024)

# Initialize with a config file
engine, optimizer, _, lr_scheduler = deepspeed.initialize(
    model=model,
    config="ds_config.json",
    model_parameters=model.parameters(),
)

# Initialize with a config dictionary
config = {
    "train_batch_size": 32,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {"stage": 2},
    "fp16": {"enabled": True, "initial_scale_power": 16},
    "optimizer": {
        "type": "Adam",
        "params": {"lr": 3e-5, "betas": [0.9, 0.999]}
    }
}
engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    config=config,
    model_parameters=model.parameters(),
)

# With a user-provided optimizer
user_optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
engine, optimizer, _, _ = deepspeed.initialize(
    model=model,
    optimizer=user_optimizer,
    config="ds_config.json",
)

# The engine is now the primary interface for training
outputs = engine(input_batch)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment