Workflow:Deepspeedai DeepSpeed ZeRO Distributed Training
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Training, Memory_Optimization, LLMs |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
End-to-end process for training deep learning models with DeepSpeed's ZeRO (Zero Redundancy Optimizer) across multiple GPUs, progressively partitioning optimizer states, gradients, and parameters to reduce memory footprint.
Description
This workflow covers the standard procedure for distributed training using DeepSpeed's ZeRO optimization. ZeRO partitions training state across data-parallel processes to reduce per-GPU memory usage without sacrificing computational granularity. The workflow supports three progressive stages of optimization:
- ZeRO Stage 1: Partitions optimizer states across ranks, reducing memory by up to 4x
- ZeRO Stage 2: Additionally partitions gradients, further reducing memory usage
- ZeRO Stage 3: Partitions parameters as well, enabling models larger than single-GPU memory
The process covers model definition, DeepSpeed configuration, engine initialization, the distributed training loop, and checkpoint management. It also supports optional CPU offloading for optimizer states and parameters (ZeRO-Offload and ZeRO-Infinity) to enable training even larger models on limited GPU hardware.
Usage
Execute this workflow when you need to train a deep learning model across one or more GPUs and want to reduce memory footprint through optimized state partitioning. This is the primary training workflow for any DeepSpeed user. Use ZeRO Stage 1-2 for moderate memory savings with minimal communication overhead, or Stage 3 when the model cannot fit in GPU memory even with basic partitioning. Add CPU offloading when GPU memory is severely constrained and you are willing to trade some throughput for the ability to train larger models.
Execution Steps
Step 1: Environment Setup
Set up the distributed training environment including installing DeepSpeed, ensuring PyTorch and CUDA/ROCm compilers are available, and verifying hardware compatibility. Run the DeepSpeed environment report to confirm the system is properly configured and all required ops can be compiled.
Key considerations:
- PyTorch must be installed before DeepSpeed
- A CUDA or ROCm compiler (nvcc or hipcc) is needed for C++/CUDA extension compilation
- Extensions are JIT-compiled by default; pre-compilation is optional via DS_BUILD_* environment variables
Step 2: DeepSpeed Configuration
Create a JSON configuration file that specifies the training hyperparameters, optimizer settings, precision mode (FP16/BF16), and ZeRO optimization stage. The configuration drives all DeepSpeed behavior and determines which engine features are activated.
Key considerations:
- Choose ZeRO stage based on model size and available GPU memory
- Enable CPU offloading (offload_optimizer, offload_param) if GPU memory is insufficient
- Set train_batch_size, micro_batch_per_gpu, and gradient_accumulation_steps consistently
- FP16 requires initial_scale_power configuration; BF16 can be enabled with a simple flag
Step 3: Model Definition
Define or load the PyTorch model. For ZeRO Stage 3, optionally wrap model construction inside the deepspeed.zero.Init() context manager so parameters are immediately partitioned across ranks during initialization, avoiding the memory spike of constructing the full model on each GPU.
Key considerations:
- Standard PyTorch nn.Module models work directly with DeepSpeed
- For ZeRO Stage 3 with very large models, use deepspeed.zero.Init() to partition during construction
- HuggingFace models are fully compatible
Step 4: Engine Initialization
Call deepspeed.initialize() with the model, configuration, and optional optimizer/scheduler. This returns the DeepSpeed engine (which wraps the model), the optimizer, an optional data loader, and the learning rate scheduler. The engine handles all distributed communication, gradient management, and ZeRO partitioning automatically.
Key considerations:
- Pass model_parameters to specify which parameters to optimize
- DeepSpeed can create the optimizer internally from config, or accept a user-supplied one
- Distributed communication is auto-initialized if not already done
- The returned engine replaces the raw model for all training operations
Step 5: Training Loop Execution
Execute the training loop using the DeepSpeed engine's backward() and step() methods instead of the standard PyTorch equivalents. The engine handles gradient scaling (for FP16), gradient accumulation, allreduce communication, and optimizer stepping automatically based on the configuration.
Key considerations:
- Use engine.backward(loss) instead of loss.backward()
- Use engine.step() instead of optimizer.step()
- The engine manages gradient accumulation boundaries automatically
- Use engine.train_micro_batch_size_per_gpu() for correct batch sizing
Step 6: Checkpoint Management
Save and load checkpoints using the DeepSpeed engine's save_checkpoint() and load_checkpoint() methods. These handle the distributed nature of ZeRO-partitioned states, ensuring each rank saves its own shard and all shards are properly reconstructed on load.
Key considerations:
- Each rank saves its own partition of optimizer states and parameters
- Checkpoints include model weights, optimizer states, scheduler state, and training metadata
- Use the zero_to_fp32 utility to convert ZeRO Stage 3 checkpoints to standard PyTorch format
- Universal Checkpointing allows resuming with different parallelism configurations