Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Microsoft DeepSpeedExamples ZeRO Inference Runtime

From Leeroopedia


Knowledge Sources
Domains Inference, Infrastructure, Deep_Learning
Last Updated 2026-02-07 13:00 GMT

Overview

Linux environment with DeepSpeed >= 0.10.3, a custom transformers fork for KV cache offloading, and NVIDIA GPU for running ZeRO-Inference on 175B+ parameter models with weight quantization and CPU/NVMe offloading.

Description

This environment enables inference of extremely large language models (OPT-175B, BLOOM-176B, LLaMA-2-70B) on limited GPU hardware by leveraging ZeRO Stage 3 parameter offloading to CPU RAM or NVMe storage. It supports INT4 weight quantization to reduce memory footprint by 4x and KV cache offloading to CPU or NVMe for long-sequence generation. The environment requires a custom fork of HuggingFace Transformers with KV cache offloading patches.

Usage

Use this environment for any large model inference workflow that requires running models exceeding available GPU VRAM. It is the mandatory prerequisite for the Launch_Scripts_ZeRO_Inference, Get_Model_Config, Get_DS_Model, Run_Generation, and Write_Benchmark_Log implementations.

System Requirements

Category Requirement Notes
OS Linux Ubuntu 20.04+ recommended
GPU NVIDIA GPU Tested on A6000 (48GB VRAM); minimum 16GB recommended
CPU RAM 252GB+ Required for OPT-175B/BLOOM-176B; scales with model size
NVMe SSD CS3040 or similar 5600+ MB/s sequential reads; required only for NVMe offloading
Disk 500GB+ SSD For model weight storage (175B models are ~350GB in fp16)

Dependencies

System Packages

  • CUDA Toolkit (11.x or 12.x)
  • `libaio-dev` (for NVMe offloading with DeepSpeed AIO)

Python Packages

Credentials

No specific API credentials required for public models. For gated models (e.g., Llama-2), set:

  • `HF_TOKEN`: HuggingFace API token with read access to gated model repositories.

Quick Install

# Install core packages
pip install "deepspeed>=0.10.3" torch packaging accelerate

# Install custom transformers fork with KV cache offloading
pip install git+https://github.com/tjruwase/transformers@kvcache-offload-cpu

Code Evidence

Hard version assertion from `inference/huggingface/zero_inference/run_model.py:28`:

assert version.parse(deepspeed.__version__) >= version.parse("0.10.3"), \
    "ZeRO-Inference with weight quantization and kv cache offloading is available only in DeepSpeed 0.10.3+, please upgrade DeepSpeed"

Requirements from `inference/huggingface/zero_inference/requirements.txt`:

deepspeed>=0.10.1
torch
transformers @ git+https://github.com/tjruwase/transformers@kvcache-offload-cpu
packaging
accelerate

ZeRO-3 prefetch configuration from `run_model.py` (dynamically computed):

"stage3_prefetch_bucket_size": 2 * hidden_size * hidden_size

Common Errors

Error Message Cause Solution
`AssertionError: ZeRO-Inference with weight quantization...` DeepSpeed version < 0.10.3 `pip install "deepspeed>=0.10.3"`
`CUDA out of memory` Batch size too large for VRAM Reduce batch size; enable weight quantization with `--dtype int4`
`CPU out of memory` Host RAM insufficient for model Reduce batch size or enable NVMe offloading instead of CPU
`ImportError: kv_cache_offload` Wrong transformers fork installed Install the custom fork: `pip install git+https://github.com/tjruwase/transformers@kvcache-offload-cpu`

Compatibility Notes

  • Supported Models: OPT (1.3B-175B), BLOOM (7.1B-176B), LLaMA-2 (7B-70B), Mixtral
  • Weight Quantization: INT4 with configurable group_size (default 64); asymmetric quantization
  • KV Cache Offloading: CPU or NVMe; requires custom transformers fork
  • NVMe Offloading: Requires DeepSpeed compiled with AIO support (`libaio-dev`)
  • Pinned Memory: Speeds up offload transfers but limits maximum batch size

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment