Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft DeepSpeedExamples ZeRO Stage3 Initialization

From Leeroopedia


Sources

  • Paper: ZeRO: Memory Optimizations Toward Training Trillion Parameter Models -- arXiv:1910.02054
  • Paper: ZeRO-Offload: Democratizing Billion-Scale Model Training -- arXiv:2101.06840
  • Paper: ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning -- arXiv:2104.07857

Domains

  • Distributed_Computing
  • Memory_Optimization
  • Inference

Overview

A memory optimization technique that partitions model parameters across GPUs with optional CPU/NVMe offloading to enable inference on models larger than available GPU memory.

Description

ZeRO Stage 3 partitions all model parameters across data-parallel ranks. For inference, this means a 175B-parameter model can run on a few GPUs (or even a single GPU with offloading) by keeping only 1/N of parameters on each GPU and gathering the rest on demand. The initialization phase configures the DeepSpeed engine with the correct partitioning strategy, offload targets, and quantization settings.

The initialization process involves five sequential steps:

  1. Distributed backend initialization: deepspeed.init_distributed("nccl") establishes the NCCL-based communication group for parameter gathering during inference.
  2. DeepSpeed configuration construction: A configuration dictionary is built with ZeRO Stage 3 settings, including precision (FP16 or BF16 based on the model's torch_dtype), prefetch buffer sizes, parameter persistence thresholds, and offload targets.
  3. HfDeepSpeedConfig registration: HfDeepSpeedConfig(ds_config) signals to HuggingFace's from_pretrained method that model weights should be distributed directly across devices during loading, rather than being fully materialized on each rank.
  4. Model loading: The appropriate HuggingFace model class loads weights (or dummy weights for benchmarking) with the DeepSpeed-aware distribution.
  5. DeepSpeed engine initialization: deepspeed.initialize(model=model, config_params=ds_config) wraps the model in a DeepSpeed engine that handles parameter gathering, offloading, and optional quantization transparently during forward passes.

Offload Strategies

The initialization supports three memory tiers:

Strategy Configuration Memory Tier Bandwidth Use Case
GPU-only No offload flags GPU HBM Highest Models fitting in aggregate GPU memory
CPU offload offload_param.device = "cpu" Host DRAM PCIe Gen4: ~32 GB/s Models fitting in CPU memory
NVMe offload offload_param.device = "nvme" NVMe SSD ~5.6 GB/s (sequential) Models exceeding CPU memory

Weight Quantization

When 4-bit quantization is enabled, the DeepSpeed configuration includes a weight_quantization section with quantized_initialization settings. The quantization is performed on-the-fly during from_pretrained, converting eligible layers (nn.Linear, nn.Embedding) to INT4 format with group-wise quantization:

{
    'weight_quantization': {
        'quantized_initialization': {
            'num_bits': 4,
            'group_size': 64,
            'group_dim': 1,
            'symmetric': False
        }
    }
}

NVMe Configuration

NVMe offloading requires additional async I/O configuration. Buffer sizes vary by model type due to differing layer dimensions:

Model Type Buffer Count Buffer Size Notes
BLOOM (with GDS) 3 8 GB GPU Direct Storage reduces buffer needs
BLOOM (without GDS) 5 9 GB Standard async I/O path
Mixtral 10 1 GB Mixture-of-experts requires more smaller buffers
Other (OPT, LLaMA) 5 2 GB Default configuration

Theoretical Basis

Memory Partitioning

ZeRO Stage 3 partitions all model parameters P across N data-parallel ranks. Each rank stores only P/N parameters persistently. During a forward pass, parameters are gathered via all-gather operations as needed and discarded after use.

For inference, the memory per GPU is:

Memory_per_GPU = P / (N * Q) + KV_cache + activations

where:

  • P = total parameter bytes (e.g., 175B params * 2 bytes/param = 350 GB in FP16)
  • N = number of GPUs
  • Q = quantization ratio (1 for FP16, 2 for 8-bit, 4 for 4-bit)

Example Memory Calculations

Model Params FP16 Size GPUs (N) Quant (Q) Memory per GPU (params only)
OPT-175B 175B 350 GB 1 4 (4-bit) ~87.5 GB (requires CPU offload)
OPT-175B 175B 350 GB 8 1 (FP16) ~43.75 GB
OPT-175B 175B 350 GB 8 4 (4-bit) ~10.9 GB
LLaMA-2-70B 70B 140 GB 1 4 (4-bit) ~35 GB
BLOOM-176B 176B 352 GB 1 4 (4-bit) ~88 GB (requires CPU offload)

HfDeepSpeedConfig Mechanism

The HfDeepSpeedConfig object works by patching the HuggingFace model loading internals. When instantiated, it registers a global configuration that from_pretrained checks. If found, the model loading process:

  1. Allocates model parameters on meta device (no memory used).
  2. Distributes parameters across ranks according to ZeRO Stage 3 partitioning.
  3. Loads weights from disk/network directly into the partitioned buffers.

This avoids the memory spike of loading the full model on every rank before partitioning.

ZeRO Stage 3 Configuration Parameters

Parameter Value Description
stage 3 Full parameter partitioning
stage3_prefetch_bucket_size 2 * H * H Size of prefetch buffers for overlapping communication with computation
stage3_param_persistence_threshold H Parameters with fewer elements than this stay on all ranks
stage3_max_live_parameters 2 * H * H Maximum parameters materialized simultaneously during forward pass

where H is the model's hidden_size.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment