Environment:Microsoft DeepSpeedExamples ZeRO Inference Runtime
| Knowledge Sources | |
|---|---|
| Domains | Inference, Infrastructure, Deep_Learning |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
Linux environment with DeepSpeed >= 0.10.3, a custom transformers fork for KV cache offloading, and NVIDIA GPU for running ZeRO-Inference on 175B+ parameter models with weight quantization and CPU/NVMe offloading.
Description
This environment enables inference of extremely large language models (OPT-175B, BLOOM-176B, LLaMA-2-70B) on limited GPU hardware by leveraging ZeRO Stage 3 parameter offloading to CPU RAM or NVMe storage. It supports INT4 weight quantization to reduce memory footprint by 4x and KV cache offloading to CPU or NVMe for long-sequence generation. The environment requires a custom fork of HuggingFace Transformers with KV cache offloading patches.
Usage
Use this environment for any large model inference workflow that requires running models exceeding available GPU VRAM. It is the mandatory prerequisite for the Launch_Scripts_ZeRO_Inference, Get_Model_Config, Get_DS_Model, Run_Generation, and Write_Benchmark_Log implementations.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | Ubuntu 20.04+ recommended |
| GPU | NVIDIA GPU | Tested on A6000 (48GB VRAM); minimum 16GB recommended |
| CPU RAM | 252GB+ | Required for OPT-175B/BLOOM-176B; scales with model size |
| NVMe SSD | CS3040 or similar | 5600+ MB/s sequential reads; required only for NVMe offloading |
| Disk | 500GB+ SSD | For model weight storage (175B models are ~350GB in fp16) |
Dependencies
System Packages
- CUDA Toolkit (11.x or 12.x)
- `libaio-dev` (for NVMe offloading with DeepSpeed AIO)
Python Packages
- `deepspeed` >= 0.10.3 (hard version assertion in code)
- `torch` (latest compatible with DeepSpeed)
- `transformers` @ git+https://github.com/tjruwase/transformers@kvcache-offload-cpu
- `packaging`
- `accelerate`
Credentials
No specific API credentials required for public models. For gated models (e.g., Llama-2), set:
- `HF_TOKEN`: HuggingFace API token with read access to gated model repositories.
Quick Install
# Install core packages
pip install "deepspeed>=0.10.3" torch packaging accelerate
# Install custom transformers fork with KV cache offloading
pip install git+https://github.com/tjruwase/transformers@kvcache-offload-cpu
Code Evidence
Hard version assertion from `inference/huggingface/zero_inference/run_model.py:28`:
assert version.parse(deepspeed.__version__) >= version.parse("0.10.3"), \
"ZeRO-Inference with weight quantization and kv cache offloading is available only in DeepSpeed 0.10.3+, please upgrade DeepSpeed"
Requirements from `inference/huggingface/zero_inference/requirements.txt`:
deepspeed>=0.10.1
torch
transformers @ git+https://github.com/tjruwase/transformers@kvcache-offload-cpu
packaging
accelerate
ZeRO-3 prefetch configuration from `run_model.py` (dynamically computed):
"stage3_prefetch_bucket_size": 2 * hidden_size * hidden_size
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `AssertionError: ZeRO-Inference with weight quantization...` | DeepSpeed version < 0.10.3 | `pip install "deepspeed>=0.10.3"` |
| `CUDA out of memory` | Batch size too large for VRAM | Reduce batch size; enable weight quantization with `--dtype int4` |
| `CPU out of memory` | Host RAM insufficient for model | Reduce batch size or enable NVMe offloading instead of CPU |
| `ImportError: kv_cache_offload` | Wrong transformers fork installed | Install the custom fork: `pip install git+https://github.com/tjruwase/transformers@kvcache-offload-cpu` |
Compatibility Notes
- Supported Models: OPT (1.3B-175B), BLOOM (7.1B-176B), LLaMA-2 (7B-70B), Mixtral
- Weight Quantization: INT4 with configurable group_size (default 64); asymmetric quantization
- KV Cache Offloading: CPU or NVMe; requires custom transformers fork
- NVMe Offloading: Requires DeepSpeed compiled with AIO support (`libaio-dev`)
- Pinned Memory: Speeds up offload transfers but limits maximum batch size
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_Launch_Scripts_ZeRO_Inference
- Implementation:Microsoft_DeepSpeedExamples_Get_Model_Config
- Implementation:Microsoft_DeepSpeedExamples_Get_DS_Model
- Implementation:Microsoft_DeepSpeedExamples_Run_Generation
- Implementation:Microsoft_DeepSpeedExamples_Write_Benchmark_Log