Workflow:Microsoft DeepSpeedExamples ZeRO Inference

Knowledge Sources	DeepSpeedExamples DeepSpeed Docs
Domains	LLMs, Inference, Memory_Optimization
Last Updated	2026-02-07 13:00 GMT

Overview

End-to-end process for running inference on massive language models (175B+ parameters) on limited GPU memory by leveraging ZeRO Stage 3 with hierarchical offloading to CPU and NVMe, combined with 4-bit weight quantization.

Description

This workflow enables inference of very large language models that would not fit in GPU memory under normal circumstances. It leverages DeepSpeed ZeRO Stage 3 to partition model weights across available memory tiers (GPU, CPU DRAM, NVMe storage) and applies optional 4-bit weight quantization to further reduce memory footprint and PCIe transfer overhead.

Goal: Run text generation inference on models like BLOOM-176B, OPT-175B, or LLaMA-2-70B using a single GPU or a small number of GPUs that would normally be insufficient.

Scope: Covers model configuration, DeepSpeed initialization with offloading, optional weight quantization, KV cache offloading, and throughput-oriented batch generation with performance measurement.

Strategy: Uses ZeRO Stage 3 to automatically partition and offload model parameters. Combines 4-bit NormalFloat quantization (reducing memory by ~4x) with optional KV cache offloading to CPU. Achieves up to 20x throughput improvement over baseline approaches.

Usage

Execute this workflow when you need to run inference on a language model that exceeds available GPU memory. This is appropriate for throughput-oriented batch inference scenarios where you have limited GPU resources (e.g., a single A6000 with 48GB VRAM) but need to run models with hundreds of billions of parameters. Not ideal for latency-sensitive online serving.

Execution Steps

Step 1: Environment Configuration

Set up the DeepSpeed inference environment with the required dependencies and configuration files for ZeRO Stage 3 offloading.

Key considerations:

Install DeepSpeed >= 0.10.3 with appropriate backends
For NVMe offloading, configure the NVMe path and I/O parameters (aio or GDS)
For KV cache offloading, install the custom Transformers fork with kvcache-offload support
Pin CPU memory for faster GPU-CPU transfers when available

Step 2: Model Configuration

Load the model configuration and tokenizer without loading full weights. For very large models, use meta-tensor initialization to avoid memory spikes during setup.

What happens:

Load tokenizer with appropriate padding and special token configuration
Retrieve model configuration (architecture, layer count, hidden dimensions)
For OPT-175B and similar models, handle the special case of distributed checkpoint loading
Use accelerate's init_empty_weights for memory-efficient model skeleton creation

Step 3: DeepSpeed Initialization

Initialize the DeepSpeed engine with ZeRO Stage 3 configuration, quantization settings, and offloading parameters.

What happens:

Build DeepSpeed config specifying ZeRO Stage 3 with appropriate bucket sizes and persistence thresholds
Configure offloading destination (CPU or NVMe) with optional pin_memory
Apply weight quantization config (4-bit or 8-bit with configurable group size)
Create the HfDeepSpeedConfig context for proper HuggingFace model integration
Initialize the model with DeepSpeed, which handles weight partitioning and offloading automatically

Step 4: Inference Execution

Run batch text generation with optional KV cache offloading and performance instrumentation.

What happens:

Encode input prompts into token sequences
Optionally enable KV cache offloading to CPU memory
Add timing hooks to measure prefill and decode phases separately
Execute model.generate() with configured generation parameters (max tokens, batch size)
Run multiple iterations for stable throughput measurement

Step 5: Performance Measurement

Collect and log performance metrics including throughput, latency, and memory usage.

What happens:

Measure prefill latency (first token) and decode latency (subsequent tokens) separately
Calculate tokens per second throughput for both phases
Record peak GPU memory usage
Log results with model size, quantization settings, and hardware configuration
Compare against baseline (no quantization, no offloading) if configured

Execution Diagram

GitHub URL

Workflow Repository