Workflow:Deepspeedai DeepSpeed ZeRO Distributed Training

Knowledge Sources	DeepSpeed DeepSpeed Documentation ZeRO Tutorial ZeRO: Memory Optimizations
Domains	Distributed_Training, Memory_Optimization, LLMs
Last Updated	2026-02-09 00:00 GMT

Overview

End-to-end process for training deep learning models with DeepSpeed's ZeRO (Zero Redundancy Optimizer) across multiple GPUs, progressively partitioning optimizer states, gradients, and parameters to reduce memory footprint.

Description

This workflow covers the standard procedure for distributed training using DeepSpeed's ZeRO optimization. ZeRO partitions training state across data-parallel processes to reduce per-GPU memory usage without sacrificing computational granularity. The workflow supports three progressive stages of optimization:

ZeRO Stage 1: Partitions optimizer states across ranks, reducing memory by up to 4x
ZeRO Stage 2: Additionally partitions gradients, further reducing memory usage
ZeRO Stage 3: Partitions parameters as well, enabling models larger than single-GPU memory

The process covers model definition, DeepSpeed configuration, engine initialization, the distributed training loop, and checkpoint management. It also supports optional CPU offloading for optimizer states and parameters (ZeRO-Offload and ZeRO-Infinity) to enable training even larger models on limited GPU hardware.

Usage

Execute this workflow when you need to train a deep learning model across one or more GPUs and want to reduce memory footprint through optimized state partitioning. This is the primary training workflow for any DeepSpeed user. Use ZeRO Stage 1-2 for moderate memory savings with minimal communication overhead, or Stage 3 when the model cannot fit in GPU memory even with basic partitioning. Add CPU offloading when GPU memory is severely constrained and you are willing to trade some throughput for the ability to train larger models.

Execution Steps

Step 1: Environment Setup

Set up the distributed training environment including installing DeepSpeed, ensuring PyTorch and CUDA/ROCm compilers are available, and verifying hardware compatibility. Run the DeepSpeed environment report to confirm the system is properly configured and all required ops can be compiled.

Key considerations:

PyTorch must be installed before DeepSpeed
A CUDA or ROCm compiler (nvcc or hipcc) is needed for C++/CUDA extension compilation
Extensions are JIT-compiled by default; pre-compilation is optional via DS_BUILD_* environment variables

Step 2: DeepSpeed Configuration

Create a JSON configuration file that specifies the training hyperparameters, optimizer settings, precision mode (FP16/BF16), and ZeRO optimization stage. The configuration drives all DeepSpeed behavior and determines which engine features are activated.

Key considerations:

Choose ZeRO stage based on model size and available GPU memory
Enable CPU offloading (offload_optimizer, offload_param) if GPU memory is insufficient
Set train_batch_size, micro_batch_per_gpu, and gradient_accumulation_steps consistently
FP16 requires initial_scale_power configuration; BF16 can be enabled with a simple flag

Step 3: Model Definition

Define or load the PyTorch model. For ZeRO Stage 3, optionally wrap model construction inside the deepspeed.zero.Init() context manager so parameters are immediately partitioned across ranks during initialization, avoiding the memory spike of constructing the full model on each GPU.

Key considerations:

Standard PyTorch nn.Module models work directly with DeepSpeed
For ZeRO Stage 3 with very large models, use deepspeed.zero.Init() to partition during construction
HuggingFace models are fully compatible

Step 4: Engine Initialization

Call deepspeed.initialize() with the model, configuration, and optional optimizer/scheduler. This returns the DeepSpeed engine (which wraps the model), the optimizer, an optional data loader, and the learning rate scheduler. The engine handles all distributed communication, gradient management, and ZeRO partitioning automatically.

Key considerations:

Pass model_parameters to specify which parameters to optimize
DeepSpeed can create the optimizer internally from config, or accept a user-supplied one
Distributed communication is auto-initialized if not already done
The returned engine replaces the raw model for all training operations

Step 5: Training Loop Execution

Execute the training loop using the DeepSpeed engine's backward() and step() methods instead of the standard PyTorch equivalents. The engine handles gradient scaling (for FP16), gradient accumulation, allreduce communication, and optimizer stepping automatically based on the configuration.

Key considerations:

Use engine.backward(loss) instead of loss.backward()
Use engine.step() instead of optimizer.step()
The engine manages gradient accumulation boundaries automatically
Use engine.train_micro_batch_size_per_gpu() for correct batch sizing

Step 6: Checkpoint Management

Save and load checkpoints using the DeepSpeed engine's save_checkpoint() and load_checkpoint() methods. These handle the distributed nature of ZeRO-partitioned states, ensuring each rank saves its own shard and all shards are properly reconstructed on load.

Key considerations:

Each rank saves its own partition of optimizer states and parameters
Checkpoints include model weights, optimizer states, scheduler state, and training metadata
Use the zero_to_fp32 utility to convert ZeRO Stage 3 checkpoints to standard PyTorch format
Universal Checkpointing allows resuming with different parallelism configurations

Execution Diagram

GitHub URL

Workflow Repository