Principle:Microsoft DeepSpeedExamples Inference Performance Measurement
Sources
- Blog: ZeRO-Inference: Democratizing massive model inference -- deepspeed.ai/2022/09/09/zero-inference
- Paper: ZeRO-Inference: Democratizing massive model inference -- arXiv:2207.00032
Domains
- Performance
- Benchmarking
- Inference
Overview
A benchmarking methodology for measuring inference performance including prefill latency, decode throughput, and memory utilization.
Description
Performance measurement for LLM inference tracks multiple complementary metrics that together characterize system behavior. The ZeRO-Inference benchmarking framework measures:
- Prefill latency: Time to process the input prompt (all tokens processed in parallel).
- Decode latency: Time to generate all output tokens (autoregressive, one token at a time).
- Total latency: End-to-end time for the complete generation (prefill + decode).
- Prefill throughput: Rate of prompt token processing (tokens/second).
- Decode throughput: Rate of token generation (tokens/second).
- Total throughput: Overall token generation rate (tokens/second).
- Peak GPU memory: Maximum GPU HBM allocated during the generation.
- Model size: Total parameter memory in bytes (computed from architecture).
- KV cache size: Memory consumed by key-value attention caches.
- Hidden state size: Memory for intermediate hidden representations.
Why GPU-Synchronized Timers Are Critical
CUDA operations are asynchronous by default: the CPU dispatches work to the GPU and continues executing without waiting for completion. Without explicit GPU synchronization, timing measurements would only capture the time to launch CUDA kernels, not the time for them to complete. The ZeRO-Inference benchmarking framework addresses this with two synchronization mechanisms:
- Timer-level synchronization: The
timers("generate-forward")timer callsget_accelerator().synchronize()before recording start and stop timestamps. - Hook-level synchronization: The model forward hooks call
torch.cuda.synchronize()before recording prefill start/end times.
Warm-up and Iteration
The benchmarking loop runs for a configurable number of iterations (--loops, default 3). The last iteration is used for reporting rather than averaging, under the assumption that:
- The first iterations warm up CUDA contexts, JIT compilation, and memory allocators.
- The last iteration reflects steady-state performance.
Theoretical Basis
Throughput Metrics
| Metric | Formula | Description |
|---|---|---|
| Prefill throughput | batch_size * prompt_len / prefill_latency |
Tokens processed per second during prompt encoding |
| Decode throughput | batch_size * (gen_len - 1) / decode_latency |
Tokens generated per second during autoregressive decoding |
| Total throughput | batch_size * gen_len / total_latency |
End-to-end tokens generated per second |
The decode throughput uses gen_len - 1 because the first token in the generation is produced as part of the prefill phase (the model outputs the first new token at the end of processing the prompt).
Latency Decomposition
total_latency = prefill_latency + decode_latency
Prefill and decode have fundamentally different computational profiles:
| Phase | Bottleneck | Scaling | With Offloading |
|---|---|---|---|
| Prefill | Compute (FLOPS) | Proportional to prompt_len * hidden_size^2 |
Parameter fetch overlaps with computation |
| Decode | Memory bandwidth | Proportional to gen_len * model_size / bandwidth |
Each token requires parameter fetch from CPU/NVMe |
Model Size Estimation
The model size in bytes is estimated from architectural parameters:
model_bytes = 2 * (num_layers * (
# self-attention: Q, K, V projections + output projection
hidden_size * (3 * hidden_size + 1) + hidden_size * (hidden_size + 1) +
# MLP: up-projection + down-projection
hidden_size * (4 * hidden_size + 1) + hidden_size * 4 * (hidden_size + 1) +
# layer norms (2 per layer, each with weight + bias)
hidden_size * 4
) +
# token embedding + LM head
vocab_size * (hidden_size + 1))
The factor of 2 accounts for FP16 representation (2 bytes per parameter).
KV Cache Size
cache_bytes = 2 * batch_size * seq_len * num_layers * hidden_size * 2
where:
- First factor of 2: keys and values
seq_len = prompt_len + gen_len- Last factor of 2: FP16 (2 bytes per element)
Hidden State Size
hidden_bytes = batch_size * seq_len * hidden_size * 2
This represents the memory for a single layer's hidden state activations in FP16.
Benchmark Log Format
The benchmark log records all metrics in a structured, human-readable text format:
model size: 326.914 GB cache size: 1.875 GB hidden size (p): 0.010 GB
peak gpu mem: 12.341 GB prefill latency: 8.234 s prefill throughput: 497.52 token/s
decode latency: 42.106 s decode throughput: 5.89 token/s
total latency: 50.340 s total throughput: 5.08 token/s
Each line of the log captures a complementary dimension of system behavior:
- Line 1: Memory characteristics (model size, cache overhead, hidden state size)
- Line 2: Peak GPU utilization and prefill performance
- Line 3: Decode performance (typically the bottleneck for offloaded models)
- Line 4: End-to-end performance summary
Automatic Log File Naming
When output_file="auto", the log filename encodes the full experiment configuration:
ds-{model}-bs{batch}-prompt{prompt_len}-gen{gen_len}-n{nodes}x{gpus}-{offload}[-kv_offload][-w_quant].log
Examples:
ds-opt-175b-bs8-prompt512-gen32-n1x1-cpu-kv_offload-w_quant.logds-bloom-bs32-prompt512-gen32-n1x1-disk-kv_offload.logds-Llama-2-70b-hf-bs64-prompt512-gen32-n1x1-cpu.log
Performance Comparison Reference
The following table shows representative throughput measurements (tokens/second) from the ZeRO-Inference benchmarks on a single NVIDIA A6000 (48 GB HBM, 252 GB CPU RAM):
| Model | Weight Quant | KV Offload | Batch Size | Throughput (tok/s) |
|---|---|---|---|---|
| OPT-30B | 4-bit | No | 24 | 22.74 |
| OPT-30B | No | Yes | 96 | 12.32 |
| OPT-30B | 4-bit | Yes | 128 | 19.34 |
| OPT-66B | 4-bit | Yes | 64 | 8.08 |
| OPT-175B | 4-bit | Yes | 24 | 2.26 |
| BLOOM-176B | 4-bit | Yes | 24 | 1.33 |
| LLaMA-2-70B | 4-bit | No | 96 | 24.05 |