Principle:Microsoft DeepSpeedExamples ZeRO Inference Environment
Sources
- Doc: DeepSpeed ZeRO Offloading -- deepspeed.ai/tutorials/zero-offloading
- Blog: ZeRO-Inference: Democratizing massive model inference -- deepspeed.ai/2022/09/09/zero-inference
Domains
- Infrastructure
- Distributed_Computing
Overview
A deployment methodology for configuring distributed inference environments to run models exceeding single-GPU memory capacity.
Description
ZeRO-Inference uses DeepSpeed's ZeRO Stage 3 with CPU/NVMe offloading to partition model parameters across GPUs and host memory. The environment must be configured with the correct distributed backend (NCCL), GPU count, offload strategy, and quantization settings. Shell scripts serve as the configuration interface, encapsulating model-specific launch parameters.
The environment configuration establishes several critical properties:
- Distributed backend initialization: The NCCL backend is used for GPU-to-GPU communication. DeepSpeed's launcher (
deepspeed --num_gpus N) handles process spawning, rank assignment, and environment variable propagation. - Offload strategy selection: Each model and hardware combination requires a specific offload strategy. The three tiers are:
- GPU-only: All parameters reside in GPU HBM. Only feasible for models smaller than total GPU memory.
- CPU offload: Parameters are offloaded to host CPU memory and fetched to GPU on demand via PCIe. Enabled by
--cpu-offload. - NVMe (disk) offload: Parameters are offloaded to NVMe storage. Enables the largest models on the most constrained hardware. Enabled by
--disk-offload.
- Quantization configuration: 4-bit or 8-bit weight quantization reduces both memory footprint and PCIe transfer volume. Configured via
--quant_bitsand--quant_group_size. - KV cache offloading: For throughput-oriented scenarios, KV cache can be moved to CPU memory, increasing the maximum batch size. Enabled by
--kv-offload. - Pinned memory: CPU memory pinning accelerates data transfer between host and device at the cost of reduced available CPU memory. Controlled by
--pin-memory.
Environment Dependencies
The runtime environment requires:
- DeepSpeed >= 0.10.3 (for weight quantization and KV cache offloading support)
- PyTorch with CUDA support
- HuggingFace Transformers (forked version for KV cache offloading:
transformers @ git+https://github.com/tjruwase/transformers@kvcache-offload-cpu) - accelerate (for
init_empty_weightscontext manager) - packaging (for version checking)
Theoretical Basis
ZeRO Stage 3 partitions parameters, gradients, and optimizer states across data-parallel ranks. For inference, only parameter partitioning is relevant since gradients and optimizer states are not needed. The fundamental memory equation per GPU is:
Memory_per_GPU = model_size / N + activation_memory + KV_cache
where:
model_sizeis the total model parameter memory (e.g., ~350 GB for a 175B parameter model in FP16)Nis the number of data-parallel ranks (GPUs)activation_memoryis the transient memory for intermediate computationsKV_cachegrows linearly with batch size and sequence length
With CPU/NVMe offloading, the effective N extends beyond GPU count to include host memory and storage tiers. The trade-off is increased latency from PCIe transfers, which is amortized across larger batch sizes in throughput-oriented inference.
Memory Reduction from Quantization
With B-bit quantization:
Quantized_model_size = model_size * (B / 16)
For 4-bit quantization of a 175B model: 350 GB * (4/16) = 87.5 GB, a 4x reduction. This reduces both the memory footprint and the volume of data transferred over PCIe during parameter fetching.
Key Configuration Parameters
| Parameter | CLI Flag | Default | Description |
|---|---|---|---|
| GPU count | --num_gpus (deepspeed launcher) |
1 | Number of GPUs for distributed inference |
| CPU offload | --cpu-offload |
Disabled | Offload model parameters to CPU memory |
| Disk offload | --disk-offload |
Disabled | Offload model parameters to NVMe storage |
| Offload directory | --offload-dir |
~/offload_dir |
Path for NVMe offload storage |
| Quantization bits | --quant_bits |
16 | Weight quantization precision (4, 8, or 16) |
| Quantization group size | --quant_group_size |
64 | Number of weights per quantization group |
| KV cache offload | --kv-offload |
Disabled | Offload KV cache to CPU memory |
| Pin memory | --pin-memory |
0 | Use pinned CPU memory for faster transfers |
| Batch size | --batch-size |
1 | Number of sequences processed in parallel |
| Prompt length | --prompt-len |
512 | Length of the input prompt in tokens |
| Generation length | --gen-len |
32 | Number of tokens to generate |
| Benchmark loops | --loops |
3 | Number of generation iterations for benchmarking |
Hardware Reference Configuration
The reference benchmarks use:
| Component | Specification |
|---|---|
| GPU | NVIDIA A6000 (48 GB HBM) |
| CPU Memory | 252 GB host RAM |
| NVMe | PNY CS3040 2TB (5600 MB/s sequential reads) |
| Interconnect | PCIe Gen4 |