Environment:Microsoft DeepSpeedExamples ZeRO Inference Runtime

Knowledge Sources	ZeRO-Inference ZeRO-Inference README
Domains	Inference, Infrastructure, Deep_Learning
Last Updated	2026-02-07 13:00 GMT

Overview

Linux environment with DeepSpeed >= 0.10.3, a custom transformers fork for KV cache offloading, and NVIDIA GPU for running ZeRO-Inference on 175B+ parameter models with weight quantization and CPU/NVMe offloading.

Description

This environment enables inference of extremely large language models (OPT-175B, BLOOM-176B, LLaMA-2-70B) on limited GPU hardware by leveraging ZeRO Stage 3 parameter offloading to CPU RAM or NVMe storage. It supports INT4 weight quantization to reduce memory footprint by 4x and KV cache offloading to CPU or NVMe for long-sequence generation. The environment requires a custom fork of HuggingFace Transformers with KV cache offloading patches.

Usage

Use this environment for any large model inference workflow that requires running models exceeding available GPU VRAM. It is the mandatory prerequisite for the Launch_Scripts_ZeRO_Inference, Get_Model_Config, Get_DS_Model, Run_Generation, and Write_Benchmark_Log implementations.

System Requirements

Category	Requirement	Notes
OS	Linux	Ubuntu 20.04+ recommended
GPU	NVIDIA GPU	Tested on A6000 (48GB VRAM); minimum 16GB recommended
CPU RAM	252GB+	Required for OPT-175B/BLOOM-176B; scales with model size
NVMe SSD	CS3040 or similar	5600+ MB/s sequential reads; required only for NVMe offloading
Disk	500GB+ SSD	For model weight storage (175B models are ~350GB in fp16)

Dependencies

System Packages

CUDA Toolkit (11.x or 12.x)
`libaio-dev` (for NVMe offloading with DeepSpeed AIO)

Python Packages

`deepspeed` >= 0.10.3 (hard version assertion in code)
`torch` (latest compatible with DeepSpeed)
`transformers` @ git+https://github.com/tjruwase/transformers@kvcache-offload-cpu
`packaging`
`accelerate`

Credentials

No specific API credentials required for public models. For gated models (e.g., Llama-2), set:

`HF_TOKEN`: HuggingFace API token with read access to gated model repositories.

Quick Install

# Install core packages
pip install "deepspeed>=0.10.3" torch packaging accelerate

# Install custom transformers fork with KV cache offloading
pip install git+https://github.com/tjruwase/transformers@kvcache-offload-cpu

Code Evidence

Hard version assertion from `inference/huggingface/zero_inference/run_model.py:28`:

assert version.parse(deepspeed.__version__) >= version.parse("0.10.3"), \
    "ZeRO-Inference with weight quantization and kv cache offloading is available only in DeepSpeed 0.10.3+, please upgrade DeepSpeed"

Requirements from `inference/huggingface/zero_inference/requirements.txt`:

deepspeed>=0.10.1
torch
transformers @ git+https://github.com/tjruwase/transformers@kvcache-offload-cpu
packaging
accelerate

ZeRO-3 prefetch configuration from `run_model.py` (dynamically computed):

"stage3_prefetch_bucket_size": 2 * hidden_size * hidden_size

Common Errors

Error Message	Cause	Solution
`AssertionError: ZeRO-Inference with weight quantization...`	DeepSpeed version < 0.10.3	`pip install "deepspeed>=0.10.3"`
`CUDA out of memory`	Batch size too large for VRAM	Reduce batch size; enable weight quantization with `--dtype int4`
`CPU out of memory`	Host RAM insufficient for model	Reduce batch size or enable NVMe offloading instead of CPU
`ImportError: kv_cache_offload`	Wrong transformers fork installed	Install the custom fork: `pip install git+https://github.com/tjruwase/transformers@kvcache-offload-cpu`

Compatibility Notes

Supported Models: OPT (1.3B-175B), BLOOM (7.1B-176B), LLaMA-2 (7B-70B), Mixtral
Weight Quantization: INT4 with configurable group_size (default 64); asymmetric quantization
KV Cache Offloading: CPU or NVMe; requires custom transformers fork
NVMe Offloading: Requires DeepSpeed compiled with AIO support (`libaio-dev`)
Pinned Memory: Speeds up offload transfers but limits maximum batch size

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment