Principle:Microsoft DeepSpeedExamples ZeRO Inference Execution
Sources
- Doc: HuggingFace Generation Strategies -- huggingface.co/docs/transformers/generation_strategies
- Blog: ZeRO-Inference -- deepspeed.ai/2022/09/09/zero-inference
Domains
- NLP
- Inference
- Performance
Overview
A technique for executing text generation on ZeRO-partitioned models with optional KV cache offloading for memory efficiency.
Description
Once a model is ZeRO-initialized (parameters partitioned across GPUs with optional CPU/NVMe offloading), inference uses HuggingFace's model.generate() API. The DeepSpeed ZeRO Stage 3 wrapper handles parameter gathering across GPUs transparently -- the model code itself is unaware that parameters are distributed. This allows standard HuggingFace generation strategies (greedy, beam search, sampling) to work without modification.
Execution Flow
The inference execution follows this sequence:
- Prompt encoding: Input text is tokenized and padded to a fixed prompt length using
tokenizer.batch_encode_plus()withmax_lengthpadding. - KV cache offloading setup: If enabled,
model.set_kv_cache_offload(True, gen_len, pin_kv_cache, async_kv_offload)configures the attention layers to move KV cache tensors to CPU memory after each layer's computation. - Timing hooks registration: Forward pre-hooks and post-hooks are registered on the model to measure prefill latency separately from decode latency.
- Model stage management: The model's
stageattribute is set to"prefill"before generation. The end-time hook switches it to"decode"after the first forward pass completes, enabling separate timing of the two phases. - Generation:
model.generate(**input_tokens, max_new_tokens=gen_len, do_sample=False)performs greedy decoding with the ZeRO-partitioned model. - Result collection: Output token IDs are collected, and timing costs are recorded from GPU-synchronized timers.
Two-Phase Generation
Text generation with autoregressive language models consists of two computationally distinct phases:
Phase 1: Prefill
- Processes the entire prompt in parallel (all prompt tokens attend to each other).
- Computes and caches key-value (KV) pairs for all prompt positions.
- Compute-bound: proportional to
prompt_length * hidden_size^2. - Measured separately via the
start_time_hook/end_time_hookpair on the model.
Phase 2: Decode
- Generates tokens one at a time (autoregressive).
- Each new token attends to all previous tokens via the KV cache.
- Memory-bound: dominated by loading model parameters and reading the KV cache for each token.
- The KV cache grows with each generated token.
KV Cache Offloading
The KV cache stores the key and value projections for all previous tokens at every attention layer. For large models with long sequences, this consumes significant GPU memory:
KV_cache_size = 2 * num_layers * hidden_size * seq_len * batch_size * bytes_per_element
KV cache offloading moves these tensors to CPU memory after each layer's attention computation, freeing GPU HBM for larger batch sizes. The trade-off is additional PCIe transfer overhead per decode step. Two optimizations mitigate this:
- Pinned KV cache (
pin_kv_cache): Allocates CPU-side KV cache in pinned (page-locked) memory for faster PCIe transfers. - Async KV offload (
async_kv_offload): Uses non-blocking CUDA memory copies to overlap KV cache transfer with computation.
Theoretical Basis
Prefill Throughput
Prefill_throughput = batch_size * prompt_len / prefill_latency
Prefill is compute-bound; throughput scales with GPU FLOPS and batch size until memory limits are reached.
Decode Throughput
Decode_throughput = batch_size * (gen_len - 1) / decode_latency
Decode is memory-bandwidth-bound; each token generation requires loading model parameters (from GPU, CPU, or NVMe) and reading/writing the KV cache.
Total Throughput
Total_throughput = batch_size * gen_len / total_latency
KV Cache Memory
For FP16 models:
KV_cache_bytes = 2 * num_layers * hidden_size * (prompt_len + gen_len) * batch_size * 2
The factor of 2 at the start accounts for both keys and values; the factor of 2 at the end accounts for FP16 (2 bytes per element).
Example: For OPT-175B (num_layers=96, hidden_size=12288) with batch_size=8, prompt_len=512, gen_len=32:
KV_cache = 2 * 96 * 12288 * 544 * 8 * 2 = ~25.1 GB
This demonstrates why KV cache offloading is critical for large-batch inference with massive models.
Throughput-Latency Trade-off
| Configuration | Latency per Token | Max Batch Size | Throughput |
|---|---|---|---|
| GPU-only (no offload) | Lowest | Smallest | Low (limited by batch size) |
| CPU offload | Medium | Medium | Medium |
| CPU offload + KV offload | Higher per token | Largest | Highest (more parallelism) |
| CPU offload + quantization | Medium | Large | High |
| CPU offload + quantization + KV offload | Higher per token | Largest | Highest overall |
The optimal throughput configuration depends on the trade-off between per-token latency (increased by offloading) and batch size (increased by freeing GPU memory). For throughput-oriented inference, the combination of weight quantization and KV cache offloading typically achieves the best results.
Generation Parameters
| Parameter | Value | Description |
|---|---|---|
max_new_tokens |
gen_len (default 32) |
Number of tokens to generate per sequence |
do_sample |
False |
Greedy decoding (deterministic, optimal for benchmarking) |
padding |
"max_length" |
Pad all prompts to prompt_len for consistent batch shapes
|