Heuristic:Microsoft DeepSpeedExamples ZeRO Inference Throughput Tuning
| Knowledge Sources | |
|---|---|
| Domains | Inference, Optimization, LLMs |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
Performance tuning strategy for ZeRO-Inference: optimal throughput occurs before maximum batch size, pinned memory trades batch capacity for transfer speed, and quantization configuration should start simple then refine.
Description
ZeRO-Inference offloads model parameters to CPU RAM or NVMe storage during inference. The relationship between batch size and throughput is non-linear: increasing batch size initially improves throughput through better GPU utilization, but beyond a critical point, CPU memory pressure from managing offloaded parameters causes throughput to degrade. Similarly, pinned memory accelerates PCIe transfers but reduces the available CPU memory pool, limiting the maximum achievable batch size. Finding the optimal configuration requires empirical profiling on your specific hardware.
Usage
Apply this heuristic when setting up ZeRO-Inference for production or benchmarking. Start with conservative settings and incrementally tune batch size and memory pinning. This is especially relevant when running 175B+ models on limited GPU hardware (single A6000).
The Insight (Rule of Thumb)
- Action 1: Do not assume maximum batch size gives maximum throughput. Profile throughput at multiple batch sizes (e.g., 1, 2, 4, 8, 16, 32, 48, 64, 96).
- Action 2: Test both pinned and unpinned memory configurations. Pinned memory speeds up PCIe transfers (CPU <-> GPU) but limits maximum batch size.
- Action 3: Start with `quantized_initialization` for weight quantization (easiest), then switch to `post_init_quantization` for per-layer control if needed.
- Value: Default quantization group_size=64 with asymmetric quantization. KV cache buffer_count=3-10, buffer_size=1-9GB depending on model.
- Trade-off: Pinned memory: faster transfers but smaller maximum batch. Quantization: lower memory but potential quality degradation. NVMe offloading: slower but unlimited capacity.
Reasoning
The throughput curve for offloaded inference follows an inverted-U shape. At small batch sizes, the GPU is underutilized during computation while waiting for parameter transfers. At large batch sizes, the CPU becomes a bottleneck managing memory allocations, page table entries, and PCIe transfer scheduling. The optimal point depends on:
- GPU compute capacity
- PCIe bandwidth (Gen4 vs Gen5)
- CPU memory bandwidth and capacity
- Model architecture (hidden size determines prefetch bucket size)
The prefetch bucket size is dynamically computed as `2 * hidden_size * hidden_size`, which means larger models require more aggressive prefetching. For OPT-175B on a single A6000 with INT4 quantization and KV offloading, the observed throughput was 2.26 tokens/sec.
Code Evidence:
Prefetch configuration from `inference/huggingface/zero_inference/run_model.py`:
"stage3_prefetch_bucket_size": 2 * hidden_size * hidden_size
Version guard from `run_model.py:28`:
assert version.parse(deepspeed.__version__) >= version.parse("0.10.3"), \
"ZeRO-Inference with weight quantization and kv cache offloading is available only in DeepSpeed 0.10.3+"