Principle:Microsoft DeepSpeedExamples ZeRO Stage3 Initialization

Sources

Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models -- arXiv:1910.02054
Paper: ZeRO-Offload: Democratizing Billion-Scale Model Training -- arXiv:2101.06840
Paper: ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning -- arXiv:2104.07857

Domains

Distributed_Computing
Memory_Optimization
Inference

Overview

A memory optimization technique that partitions model parameters across GPUs with optional CPU/NVMe offloading to enable inference on models larger than available GPU memory.

Description

ZeRO Stage 3 partitions all model parameters across data-parallel ranks. For inference, this means a 175B-parameter model can run on a few GPUs (or even a single GPU with offloading) by keeping only 1/N of parameters on each GPU and gathering the rest on demand. The initialization phase configures the DeepSpeed engine with the correct partitioning strategy, offload targets, and quantization settings.

The initialization process involves five sequential steps:

Distributed backend initialization: deepspeed.init_distributed("nccl") establishes the NCCL-based communication group for parameter gathering during inference.
DeepSpeed configuration construction: A configuration dictionary is built with ZeRO Stage 3 settings, including precision (FP16 or BF16 based on the model's torch_dtype), prefetch buffer sizes, parameter persistence thresholds, and offload targets.
HfDeepSpeedConfig registration: HfDeepSpeedConfig(ds_config) signals to HuggingFace's from_pretrained method that model weights should be distributed directly across devices during loading, rather than being fully materialized on each rank.
Model loading: The appropriate HuggingFace model class loads weights (or dummy weights for benchmarking) with the DeepSpeed-aware distribution.
DeepSpeed engine initialization: deepspeed.initialize(model=model, config_params=ds_config) wraps the model in a DeepSpeed engine that handles parameter gathering, offloading, and optional quantization transparently during forward passes.

Offload Strategies

The initialization supports three memory tiers:

Strategy	Configuration	Memory Tier	Bandwidth	Use Case
GPU-only	No offload flags	GPU HBM	Highest	Models fitting in aggregate GPU memory
CPU offload	`offload_param.device = "cpu"`	Host DRAM	PCIe Gen4: ~32 GB/s	Models fitting in CPU memory
NVMe offload	`offload_param.device = "nvme"`	NVMe SSD	~5.6 GB/s (sequential)	Models exceeding CPU memory

Weight Quantization

When 4-bit quantization is enabled, the DeepSpeed configuration includes a weight_quantization section with quantized_initialization settings. The quantization is performed on-the-fly during from_pretrained, converting eligible layers (nn.Linear, nn.Embedding) to INT4 format with group-wise quantization:

{
    'weight_quantization': {
        'quantized_initialization': {
            'num_bits': 4,
            'group_size': 64,
            'group_dim': 1,
            'symmetric': False
        }
    }
}

NVMe Configuration

NVMe offloading requires additional async I/O configuration. Buffer sizes vary by model type due to differing layer dimensions:

Model Type	Buffer Count	Buffer Size	Notes
BLOOM (with GDS)	3	8 GB	GPU Direct Storage reduces buffer needs
BLOOM (without GDS)	5	9 GB	Standard async I/O path
Mixtral	10	1 GB	Mixture-of-experts requires more smaller buffers
Other (OPT, LLaMA)	5	2 GB	Default configuration

Theoretical Basis

Memory Partitioning

ZeRO Stage 3 partitions all model parameters P across N data-parallel ranks. Each rank stores only P/N parameters persistently. During a forward pass, parameters are gathered via all-gather operations as needed and discarded after use.

For inference, the memory per GPU is:

Memory_per_GPU = P / (N * Q) + KV_cache + activations

where:

P = total parameter bytes (e.g., 175B params * 2 bytes/param = 350 GB in FP16)
N = number of GPUs
Q = quantization ratio (1 for FP16, 2 for 8-bit, 4 for 4-bit)

Example Memory Calculations

Model	Params	FP16 Size	GPUs (N)	Quant (Q)	Memory per GPU (params only)
OPT-175B	175B	350 GB	1	4 (4-bit)	~87.5 GB (requires CPU offload)
OPT-175B	175B	350 GB	8	1 (FP16)	~43.75 GB
OPT-175B	175B	350 GB	8	4 (4-bit)	~10.9 GB
LLaMA-2-70B	70B	140 GB	1	4 (4-bit)	~35 GB
BLOOM-176B	176B	352 GB	1	4 (4-bit)	~88 GB (requires CPU offload)

HfDeepSpeedConfig Mechanism

The HfDeepSpeedConfig object works by patching the HuggingFace model loading internals. When instantiated, it registers a global configuration that from_pretrained checks. If found, the model loading process:

Allocates model parameters on meta device (no memory used).
Distributes parameters across ranks according to ZeRO Stage 3 partitioning.
Loads weights from disk/network directly into the partitioned buffers.

This avoids the memory spike of loading the full model on every rank before partitioning.

ZeRO Stage 3 Configuration Parameters

Parameter	Value	Description
`stage`	3	Full parameter partitioning
`stage3_prefetch_bucket_size`	`2 * H * H`	Size of prefetch buffers for overlapping communication with computation
`stage3_param_persistence_threshold`	`H`	Parameters with fewer elements than this stay on all ranks
`stage3_max_live_parameters`	`2 * H * H`	Maximum parameters materialized simultaneously during forward pass

where H is the model's hidden_size.

Related Pages

Implementation:Microsoft_DeepSpeedExamples_Get_DS_Model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment