Workflow:Microsoft DeepSpeedExamples ZeRO Inference
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Inference, Memory_Optimization |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
End-to-end process for running inference on massive language models (175B+ parameters) on limited GPU memory by leveraging ZeRO Stage 3 with hierarchical offloading to CPU and NVMe, combined with 4-bit weight quantization.
Description
This workflow enables inference of very large language models that would not fit in GPU memory under normal circumstances. It leverages DeepSpeed ZeRO Stage 3 to partition model weights across available memory tiers (GPU, CPU DRAM, NVMe storage) and applies optional 4-bit weight quantization to further reduce memory footprint and PCIe transfer overhead.
Goal: Run text generation inference on models like BLOOM-176B, OPT-175B, or LLaMA-2-70B using a single GPU or a small number of GPUs that would normally be insufficient.
Scope: Covers model configuration, DeepSpeed initialization with offloading, optional weight quantization, KV cache offloading, and throughput-oriented batch generation with performance measurement.
Strategy: Uses ZeRO Stage 3 to automatically partition and offload model parameters. Combines 4-bit NormalFloat quantization (reducing memory by ~4x) with optional KV cache offloading to CPU. Achieves up to 20x throughput improvement over baseline approaches.
Usage
Execute this workflow when you need to run inference on a language model that exceeds available GPU memory. This is appropriate for throughput-oriented batch inference scenarios where you have limited GPU resources (e.g., a single A6000 with 48GB VRAM) but need to run models with hundreds of billions of parameters. Not ideal for latency-sensitive online serving.
Execution Steps
Step 1: Environment Configuration
Set up the DeepSpeed inference environment with the required dependencies and configuration files for ZeRO Stage 3 offloading.
Key considerations:
- Install DeepSpeed >= 0.10.3 with appropriate backends
- For NVMe offloading, configure the NVMe path and I/O parameters (aio or GDS)
- For KV cache offloading, install the custom Transformers fork with kvcache-offload support
- Pin CPU memory for faster GPU-CPU transfers when available
Step 2: Model Configuration
Load the model configuration and tokenizer without loading full weights. For very large models, use meta-tensor initialization to avoid memory spikes during setup.
What happens:
- Load tokenizer with appropriate padding and special token configuration
- Retrieve model configuration (architecture, layer count, hidden dimensions)
- For OPT-175B and similar models, handle the special case of distributed checkpoint loading
- Use accelerate's init_empty_weights for memory-efficient model skeleton creation
Step 3: DeepSpeed Initialization
Initialize the DeepSpeed engine with ZeRO Stage 3 configuration, quantization settings, and offloading parameters.
What happens:
- Build DeepSpeed config specifying ZeRO Stage 3 with appropriate bucket sizes and persistence thresholds
- Configure offloading destination (CPU or NVMe) with optional pin_memory
- Apply weight quantization config (4-bit or 8-bit with configurable group size)
- Create the HfDeepSpeedConfig context for proper HuggingFace model integration
- Initialize the model with DeepSpeed, which handles weight partitioning and offloading automatically
Step 4: Inference Execution
Run batch text generation with optional KV cache offloading and performance instrumentation.
What happens:
- Encode input prompts into token sequences
- Optionally enable KV cache offloading to CPU memory
- Add timing hooks to measure prefill and decode phases separately
- Execute model.generate() with configured generation parameters (max tokens, batch size)
- Run multiple iterations for stable throughput measurement
Step 5: Performance Measurement
Collect and log performance metrics including throughput, latency, and memory usage.
What happens:
- Measure prefill latency (first token) and decode latency (subsequent tokens) separately
- Calculate tokens per second throughput for both phases
- Record peak GPU memory usage
- Log results with model size, quantization settings, and hardware configuration
- Compare against baseline (no quantization, no offloading) if configured