Principle:Microsoft DeepSpeedExamples ZeRO Inference Execution

Sources

Doc: HuggingFace Generation Strategies -- huggingface.co/docs/transformers/generation_strategies
Blog: ZeRO-Inference -- deepspeed.ai/2022/09/09/zero-inference

Domains

NLP
Inference
Performance

Overview

A technique for executing text generation on ZeRO-partitioned models with optional KV cache offloading for memory efficiency.

Description

Once a model is ZeRO-initialized (parameters partitioned across GPUs with optional CPU/NVMe offloading), inference uses HuggingFace's model.generate() API. The DeepSpeed ZeRO Stage 3 wrapper handles parameter gathering across GPUs transparently -- the model code itself is unaware that parameters are distributed. This allows standard HuggingFace generation strategies (greedy, beam search, sampling) to work without modification.

Execution Flow

The inference execution follows this sequence:

Prompt encoding: Input text is tokenized and padded to a fixed prompt length using tokenizer.batch_encode_plus() with max_length padding.
KV cache offloading setup: If enabled, model.set_kv_cache_offload(True, gen_len, pin_kv_cache, async_kv_offload) configures the attention layers to move KV cache tensors to CPU memory after each layer's computation.
Timing hooks registration: Forward pre-hooks and post-hooks are registered on the model to measure prefill latency separately from decode latency.
Model stage management: The model's stage attribute is set to "prefill" before generation. The end-time hook switches it to "decode" after the first forward pass completes, enabling separate timing of the two phases.
Generation: model.generate(**input_tokens, max_new_tokens=gen_len, do_sample=False) performs greedy decoding with the ZeRO-partitioned model.
Result collection: Output token IDs are collected, and timing costs are recorded from GPU-synchronized timers.

Two-Phase Generation

Text generation with autoregressive language models consists of two computationally distinct phases:

Phase 1: Prefill

Processes the entire prompt in parallel (all prompt tokens attend to each other).
Computes and caches key-value (KV) pairs for all prompt positions.
Compute-bound: proportional to prompt_length * hidden_size^2.
Measured separately via the start_time_hook / end_time_hook pair on the model.

Phase 2: Decode

Generates tokens one at a time (autoregressive).
Each new token attends to all previous tokens via the KV cache.
Memory-bound: dominated by loading model parameters and reading the KV cache for each token.
The KV cache grows with each generated token.

KV Cache Offloading

The KV cache stores the key and value projections for all previous tokens at every attention layer. For large models with long sequences, this consumes significant GPU memory:

KV_cache_size = 2 * num_layers * hidden_size * seq_len * batch_size * bytes_per_element

KV cache offloading moves these tensors to CPU memory after each layer's attention computation, freeing GPU HBM for larger batch sizes. The trade-off is additional PCIe transfer overhead per decode step. Two optimizations mitigate this:

Pinned KV cache (pin_kv_cache): Allocates CPU-side KV cache in pinned (page-locked) memory for faster PCIe transfers.
Async KV offload (async_kv_offload): Uses non-blocking CUDA memory copies to overlap KV cache transfer with computation.

Theoretical Basis

Prefill Throughput

Prefill_throughput = batch_size * prompt_len / prefill_latency

Prefill is compute-bound; throughput scales with GPU FLOPS and batch size until memory limits are reached.

Decode Throughput

Decode_throughput = batch_size * (gen_len - 1) / decode_latency

Decode is memory-bandwidth-bound; each token generation requires loading model parameters (from GPU, CPU, or NVMe) and reading/writing the KV cache.

Total Throughput

Total_throughput = batch_size * gen_len / total_latency

KV Cache Memory

For FP16 models:

KV_cache_bytes = 2 * num_layers * hidden_size * (prompt_len + gen_len) * batch_size * 2

The factor of 2 at the start accounts for both keys and values; the factor of 2 at the end accounts for FP16 (2 bytes per element).

Example: For OPT-175B (num_layers=96, hidden_size=12288) with batch_size=8, prompt_len=512, gen_len=32:

KV_cache = 2 * 96 * 12288 * 544 * 8 * 2 = ~25.1 GB

This demonstrates why KV cache offloading is critical for large-batch inference with massive models.

Throughput-Latency Trade-off

Configuration	Latency per Token	Max Batch Size	Throughput
GPU-only (no offload)	Lowest	Smallest	Low (limited by batch size)
CPU offload	Medium	Medium	Medium
CPU offload + KV offload	Higher per token	Largest	Highest (more parallelism)
CPU offload + quantization	Medium	Large	High
CPU offload + quantization + KV offload	Higher per token	Largest	Highest overall

The optimal throughput configuration depends on the trade-off between per-token latency (increased by offloading) and batch size (increased by freeing GPU memory). For throughput-oriented inference, the combination of weight quantization and KV cache offloading typically achieves the best results.

Generation Parameters

Parameter	Value	Description
`max_new_tokens`	`gen_len` (default 32)	Number of tokens to generate per sequence
`do_sample`	`False`	Greedy decoding (deterministic, optimal for benchmarking)
`padding`	`"max_length"`	Pad all prompts to `prompt_len` for consistent batch shapes

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment