Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Microsoft DeepSpeedExamples ZeRO Inference

From Leeroopedia


Knowledge Sources
Domains LLMs, Inference, Memory_Optimization
Last Updated 2026-02-07 13:00 GMT

Overview

End-to-end process for running inference on massive language models (175B+ parameters) on limited GPU memory by leveraging ZeRO Stage 3 with hierarchical offloading to CPU and NVMe, combined with 4-bit weight quantization.

Description

This workflow enables inference of very large language models that would not fit in GPU memory under normal circumstances. It leverages DeepSpeed ZeRO Stage 3 to partition model weights across available memory tiers (GPU, CPU DRAM, NVMe storage) and applies optional 4-bit weight quantization to further reduce memory footprint and PCIe transfer overhead.

Goal: Run text generation inference on models like BLOOM-176B, OPT-175B, or LLaMA-2-70B using a single GPU or a small number of GPUs that would normally be insufficient.

Scope: Covers model configuration, DeepSpeed initialization with offloading, optional weight quantization, KV cache offloading, and throughput-oriented batch generation with performance measurement.

Strategy: Uses ZeRO Stage 3 to automatically partition and offload model parameters. Combines 4-bit NormalFloat quantization (reducing memory by ~4x) with optional KV cache offloading to CPU. Achieves up to 20x throughput improvement over baseline approaches.

Usage

Execute this workflow when you need to run inference on a language model that exceeds available GPU memory. This is appropriate for throughput-oriented batch inference scenarios where you have limited GPU resources (e.g., a single A6000 with 48GB VRAM) but need to run models with hundreds of billions of parameters. Not ideal for latency-sensitive online serving.

Execution Steps

Step 1: Environment Configuration

Set up the DeepSpeed inference environment with the required dependencies and configuration files for ZeRO Stage 3 offloading.

Key considerations:

  • Install DeepSpeed >= 0.10.3 with appropriate backends
  • For NVMe offloading, configure the NVMe path and I/O parameters (aio or GDS)
  • For KV cache offloading, install the custom Transformers fork with kvcache-offload support
  • Pin CPU memory for faster GPU-CPU transfers when available

Step 2: Model Configuration

Load the model configuration and tokenizer without loading full weights. For very large models, use meta-tensor initialization to avoid memory spikes during setup.

What happens:

  • Load tokenizer with appropriate padding and special token configuration
  • Retrieve model configuration (architecture, layer count, hidden dimensions)
  • For OPT-175B and similar models, handle the special case of distributed checkpoint loading
  • Use accelerate's init_empty_weights for memory-efficient model skeleton creation

Step 3: DeepSpeed Initialization

Initialize the DeepSpeed engine with ZeRO Stage 3 configuration, quantization settings, and offloading parameters.

What happens:

  • Build DeepSpeed config specifying ZeRO Stage 3 with appropriate bucket sizes and persistence thresholds
  • Configure offloading destination (CPU or NVMe) with optional pin_memory
  • Apply weight quantization config (4-bit or 8-bit with configurable group size)
  • Create the HfDeepSpeedConfig context for proper HuggingFace model integration
  • Initialize the model with DeepSpeed, which handles weight partitioning and offloading automatically

Step 4: Inference Execution

Run batch text generation with optional KV cache offloading and performance instrumentation.

What happens:

  • Encode input prompts into token sequences
  • Optionally enable KV cache offloading to CPU memory
  • Add timing hooks to measure prefill and decode phases separately
  • Execute model.generate() with configured generation parameters (max tokens, batch size)
  • Run multiple iterations for stable throughput measurement

Step 5: Performance Measurement

Collect and log performance metrics including throughput, latency, and memory usage.

What happens:

  • Measure prefill latency (first token) and decode latency (subsequent tokens) separately
  • Calculate tokens per second throughput for both phases
  • Record peak GPU memory usage
  • Log results with model size, quantization settings, and hardware configuration
  • Compare against baseline (no quantization, no offloading) if configured

Execution Diagram

GitHub URL

Workflow Repository