Principle:Microsoft DeepSpeedExamples ZeRO Stage3 Initialization
Sources
- Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models -- arXiv:1910.02054
- Paper: ZeRO-Offload: Democratizing Billion-Scale Model Training -- arXiv:2101.06840
- Paper: ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning -- arXiv:2104.07857
Domains
- Distributed_Computing
- Memory_Optimization
- Inference
Overview
A memory optimization technique that partitions model parameters across GPUs with optional CPU/NVMe offloading to enable inference on models larger than available GPU memory.
Description
ZeRO Stage 3 partitions all model parameters across data-parallel ranks. For inference, this means a 175B-parameter model can run on a few GPUs (or even a single GPU with offloading) by keeping only 1/N of parameters on each GPU and gathering the rest on demand. The initialization phase configures the DeepSpeed engine with the correct partitioning strategy, offload targets, and quantization settings.
The initialization process involves five sequential steps:
- Distributed backend initialization:
deepspeed.init_distributed("nccl")establishes the NCCL-based communication group for parameter gathering during inference. - DeepSpeed configuration construction: A configuration dictionary is built with ZeRO Stage 3 settings, including precision (FP16 or BF16 based on the model's
torch_dtype), prefetch buffer sizes, parameter persistence thresholds, and offload targets. - HfDeepSpeedConfig registration:
HfDeepSpeedConfig(ds_config)signals to HuggingFace'sfrom_pretrainedmethod that model weights should be distributed directly across devices during loading, rather than being fully materialized on each rank. - Model loading: The appropriate HuggingFace model class loads weights (or dummy weights for benchmarking) with the DeepSpeed-aware distribution.
- DeepSpeed engine initialization:
deepspeed.initialize(model=model, config_params=ds_config)wraps the model in a DeepSpeed engine that handles parameter gathering, offloading, and optional quantization transparently during forward passes.
Offload Strategies
The initialization supports three memory tiers:
| Strategy | Configuration | Memory Tier | Bandwidth | Use Case |
|---|---|---|---|---|
| GPU-only | No offload flags | GPU HBM | Highest | Models fitting in aggregate GPU memory |
| CPU offload | offload_param.device = "cpu" |
Host DRAM | PCIe Gen4: ~32 GB/s | Models fitting in CPU memory |
| NVMe offload | offload_param.device = "nvme" |
NVMe SSD | ~5.6 GB/s (sequential) | Models exceeding CPU memory |
Weight Quantization
When 4-bit quantization is enabled, the DeepSpeed configuration includes a weight_quantization section with quantized_initialization settings. The quantization is performed on-the-fly during from_pretrained, converting eligible layers (nn.Linear, nn.Embedding) to INT4 format with group-wise quantization:
{
'weight_quantization': {
'quantized_initialization': {
'num_bits': 4,
'group_size': 64,
'group_dim': 1,
'symmetric': False
}
}
}
NVMe Configuration
NVMe offloading requires additional async I/O configuration. Buffer sizes vary by model type due to differing layer dimensions:
| Model Type | Buffer Count | Buffer Size | Notes |
|---|---|---|---|
| BLOOM (with GDS) | 3 | 8 GB | GPU Direct Storage reduces buffer needs |
| BLOOM (without GDS) | 5 | 9 GB | Standard async I/O path |
| Mixtral | 10 | 1 GB | Mixture-of-experts requires more smaller buffers |
| Other (OPT, LLaMA) | 5 | 2 GB | Default configuration |
Theoretical Basis
Memory Partitioning
ZeRO Stage 3 partitions all model parameters P across N data-parallel ranks. Each rank stores only P/N parameters persistently. During a forward pass, parameters are gathered via all-gather operations as needed and discarded after use.
For inference, the memory per GPU is:
Memory_per_GPU = P / (N * Q) + KV_cache + activations
where:
P= total parameter bytes (e.g., 175B params * 2 bytes/param = 350 GB in FP16)N= number of GPUsQ= quantization ratio (1 for FP16, 2 for 8-bit, 4 for 4-bit)
Example Memory Calculations
| Model | Params | FP16 Size | GPUs (N) | Quant (Q) | Memory per GPU (params only) |
|---|---|---|---|---|---|
| OPT-175B | 175B | 350 GB | 1 | 4 (4-bit) | ~87.5 GB (requires CPU offload) |
| OPT-175B | 175B | 350 GB | 8 | 1 (FP16) | ~43.75 GB |
| OPT-175B | 175B | 350 GB | 8 | 4 (4-bit) | ~10.9 GB |
| LLaMA-2-70B | 70B | 140 GB | 1 | 4 (4-bit) | ~35 GB |
| BLOOM-176B | 176B | 352 GB | 1 | 4 (4-bit) | ~88 GB (requires CPU offload) |
HfDeepSpeedConfig Mechanism
The HfDeepSpeedConfig object works by patching the HuggingFace model loading internals. When instantiated, it registers a global configuration that from_pretrained checks. If found, the model loading process:
- Allocates model parameters on meta device (no memory used).
- Distributes parameters across ranks according to ZeRO Stage 3 partitioning.
- Loads weights from disk/network directly into the partitioned buffers.
This avoids the memory spike of loading the full model on every rank before partitioning.
ZeRO Stage 3 Configuration Parameters
| Parameter | Value | Description |
|---|---|---|
stage |
3 | Full parameter partitioning |
stage3_prefetch_bucket_size |
2 * H * H |
Size of prefetch buffers for overlapping communication with computation |
stage3_param_persistence_threshold |
H |
Parameters with fewer elements than this stay on all ranks |
stage3_max_live_parameters |
2 * H * H |
Maximum parameters materialized simultaneously during forward pass |
where H is the model's hidden_size.