Principle:Microsoft DeepSpeedExamples ZeRO Inference Environment

Sources

Doc: DeepSpeed ZeRO Offloading -- deepspeed.ai/tutorials/zero-offloading
Blog: ZeRO-Inference: Democratizing massive model inference -- deepspeed.ai/2022/09/09/zero-inference

Domains

Infrastructure
Distributed_Computing

Overview

A deployment methodology for configuring distributed inference environments to run models exceeding single-GPU memory capacity.

Description

ZeRO-Inference uses DeepSpeed's ZeRO Stage 3 with CPU/NVMe offloading to partition model parameters across GPUs and host memory. The environment must be configured with the correct distributed backend (NCCL), GPU count, offload strategy, and quantization settings. Shell scripts serve as the configuration interface, encapsulating model-specific launch parameters.

The environment configuration establishes several critical properties:

Distributed backend initialization: The NCCL backend is used for GPU-to-GPU communication. DeepSpeed's launcher (deepspeed --num_gpus N) handles process spawning, rank assignment, and environment variable propagation.
Offload strategy selection: Each model and hardware combination requires a specific offload strategy. The three tiers are:
- GPU-only: All parameters reside in GPU HBM. Only feasible for models smaller than total GPU memory.
- CPU offload: Parameters are offloaded to host CPU memory and fetched to GPU on demand via PCIe. Enabled by --cpu-offload.
- NVMe (disk) offload: Parameters are offloaded to NVMe storage. Enables the largest models on the most constrained hardware. Enabled by --disk-offload.
Quantization configuration: 4-bit or 8-bit weight quantization reduces both memory footprint and PCIe transfer volume. Configured via --quant_bits and --quant_group_size.
KV cache offloading: For throughput-oriented scenarios, KV cache can be moved to CPU memory, increasing the maximum batch size. Enabled by --kv-offload.
Pinned memory: CPU memory pinning accelerates data transfer between host and device at the cost of reduced available CPU memory. Controlled by --pin-memory.

Environment Dependencies

The runtime environment requires:

DeepSpeed >= 0.10.3 (for weight quantization and KV cache offloading support)
PyTorch with CUDA support
HuggingFace Transformers (forked version for KV cache offloading: transformers @ git+https://github.com/tjruwase/transformers@kvcache-offload-cpu)
accelerate (for init_empty_weights context manager)
packaging (for version checking)

Theoretical Basis

ZeRO Stage 3 partitions parameters, gradients, and optimizer states across data-parallel ranks. For inference, only parameter partitioning is relevant since gradients and optimizer states are not needed. The fundamental memory equation per GPU is:

Memory_per_GPU = model_size / N + activation_memory + KV_cache

where:

model_size is the total model parameter memory (e.g., ~350 GB for a 175B parameter model in FP16)
N is the number of data-parallel ranks (GPUs)
activation_memory is the transient memory for intermediate computations
KV_cache grows linearly with batch size and sequence length

With CPU/NVMe offloading, the effective N extends beyond GPU count to include host memory and storage tiers. The trade-off is increased latency from PCIe transfers, which is amortized across larger batch sizes in throughput-oriented inference.

Memory Reduction from Quantization

With B-bit quantization:

Quantized_model_size = model_size * (B / 16)

For 4-bit quantization of a 175B model: 350 GB * (4/16) = 87.5 GB, a 4x reduction. This reduces both the memory footprint and the volume of data transferred over PCIe during parameter fetching.

Key Configuration Parameters

Parameter	CLI Flag	Default	Description
GPU count	`--num_gpus` (deepspeed launcher)	1	Number of GPUs for distributed inference
CPU offload	`--cpu-offload`	Disabled	Offload model parameters to CPU memory
Disk offload	`--disk-offload`	Disabled	Offload model parameters to NVMe storage
Offload directory	`--offload-dir`	`~/offload_dir`	Path for NVMe offload storage
Quantization bits	`--quant_bits`	16	Weight quantization precision (4, 8, or 16)
Quantization group size	`--quant_group_size`	64	Number of weights per quantization group
KV cache offload	`--kv-offload`	Disabled	Offload KV cache to CPU memory
Pin memory	`--pin-memory`	0	Use pinned CPU memory for faster transfers
Batch size	`--batch-size`	1	Number of sequences processed in parallel
Prompt length	`--prompt-len`	512	Length of the input prompt in tokens
Generation length	`--gen-len`	32	Number of tokens to generate
Benchmark loops	`--loops`	3	Number of generation iterations for benchmarking

Hardware Reference Configuration

The reference benchmarks use:

Component	Specification
GPU	NVIDIA A6000 (48 GB HBM)
CPU Memory	252 GB host RAM
NVMe	PNY CS3040 2TB (5600 MB/s sequential reads)
Interconnect	PCIe Gen4

Related Pages

Implementation:Microsoft_DeepSpeedExamples_Launch_Scripts_ZeRO_Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment