Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft DeepSpeedExamples ZeRO Inference Environment

From Leeroopedia


Sources

Domains

  • Infrastructure
  • Distributed_Computing

Overview

A deployment methodology for configuring distributed inference environments to run models exceeding single-GPU memory capacity.

Description

ZeRO-Inference uses DeepSpeed's ZeRO Stage 3 with CPU/NVMe offloading to partition model parameters across GPUs and host memory. The environment must be configured with the correct distributed backend (NCCL), GPU count, offload strategy, and quantization settings. Shell scripts serve as the configuration interface, encapsulating model-specific launch parameters.

The environment configuration establishes several critical properties:

  1. Distributed backend initialization: The NCCL backend is used for GPU-to-GPU communication. DeepSpeed's launcher (deepspeed --num_gpus N) handles process spawning, rank assignment, and environment variable propagation.
  2. Offload strategy selection: Each model and hardware combination requires a specific offload strategy. The three tiers are:
    • GPU-only: All parameters reside in GPU HBM. Only feasible for models smaller than total GPU memory.
    • CPU offload: Parameters are offloaded to host CPU memory and fetched to GPU on demand via PCIe. Enabled by --cpu-offload.
    • NVMe (disk) offload: Parameters are offloaded to NVMe storage. Enables the largest models on the most constrained hardware. Enabled by --disk-offload.
  3. Quantization configuration: 4-bit or 8-bit weight quantization reduces both memory footprint and PCIe transfer volume. Configured via --quant_bits and --quant_group_size.
  4. KV cache offloading: For throughput-oriented scenarios, KV cache can be moved to CPU memory, increasing the maximum batch size. Enabled by --kv-offload.
  5. Pinned memory: CPU memory pinning accelerates data transfer between host and device at the cost of reduced available CPU memory. Controlled by --pin-memory.

Environment Dependencies

The runtime environment requires:

  • DeepSpeed >= 0.10.3 (for weight quantization and KV cache offloading support)
  • PyTorch with CUDA support
  • HuggingFace Transformers (forked version for KV cache offloading: transformers @ git+https://github.com/tjruwase/transformers@kvcache-offload-cpu)
  • accelerate (for init_empty_weights context manager)
  • packaging (for version checking)

Theoretical Basis

ZeRO Stage 3 partitions parameters, gradients, and optimizer states across data-parallel ranks. For inference, only parameter partitioning is relevant since gradients and optimizer states are not needed. The fundamental memory equation per GPU is:

Memory_per_GPU = model_size / N + activation_memory + KV_cache

where:

  • model_size is the total model parameter memory (e.g., ~350 GB for a 175B parameter model in FP16)
  • N is the number of data-parallel ranks (GPUs)
  • activation_memory is the transient memory for intermediate computations
  • KV_cache grows linearly with batch size and sequence length

With CPU/NVMe offloading, the effective N extends beyond GPU count to include host memory and storage tiers. The trade-off is increased latency from PCIe transfers, which is amortized across larger batch sizes in throughput-oriented inference.

Memory Reduction from Quantization

With B-bit quantization:

Quantized_model_size = model_size * (B / 16)

For 4-bit quantization of a 175B model: 350 GB * (4/16) = 87.5 GB, a 4x reduction. This reduces both the memory footprint and the volume of data transferred over PCIe during parameter fetching.

Key Configuration Parameters

Parameter CLI Flag Default Description
GPU count --num_gpus (deepspeed launcher) 1 Number of GPUs for distributed inference
CPU offload --cpu-offload Disabled Offload model parameters to CPU memory
Disk offload --disk-offload Disabled Offload model parameters to NVMe storage
Offload directory --offload-dir ~/offload_dir Path for NVMe offload storage
Quantization bits --quant_bits 16 Weight quantization precision (4, 8, or 16)
Quantization group size --quant_group_size 64 Number of weights per quantization group
KV cache offload --kv-offload Disabled Offload KV cache to CPU memory
Pin memory --pin-memory 0 Use pinned CPU memory for faster transfers
Batch size --batch-size 1 Number of sequences processed in parallel
Prompt length --prompt-len 512 Length of the input prompt in tokens
Generation length --gen-len 32 Number of tokens to generate
Benchmark loops --loops 3 Number of generation iterations for benchmarking

Hardware Reference Configuration

The reference benchmarks use:

Component Specification
GPU NVIDIA A6000 (48 GB HBM)
CPU Memory 252 GB host RAM
NVMe PNY CS3040 2TB (5600 MB/s sequential reads)
Interconnect PCIe Gen4

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment