Principle:Microsoft DeepSpeedExamples Inference Performance Measurement

Sources

Blog: ZeRO-Inference: Democratizing massive model inference -- deepspeed.ai/2022/09/09/zero-inference
Paper: ZeRO-Inference: Democratizing massive model inference -- arXiv:2207.00032

Domains

Performance
Benchmarking
Inference

Overview

A benchmarking methodology for measuring inference performance including prefill latency, decode throughput, and memory utilization.

Description

Performance measurement for LLM inference tracks multiple complementary metrics that together characterize system behavior. The ZeRO-Inference benchmarking framework measures:

Prefill latency: Time to process the input prompt (all tokens processed in parallel).
Decode latency: Time to generate all output tokens (autoregressive, one token at a time).
Total latency: End-to-end time for the complete generation (prefill + decode).
Prefill throughput: Rate of prompt token processing (tokens/second).
Decode throughput: Rate of token generation (tokens/second).
Total throughput: Overall token generation rate (tokens/second).
Peak GPU memory: Maximum GPU HBM allocated during the generation.
Model size: Total parameter memory in bytes (computed from architecture).
KV cache size: Memory consumed by key-value attention caches.
Hidden state size: Memory for intermediate hidden representations.

Why GPU-Synchronized Timers Are Critical

CUDA operations are asynchronous by default: the CPU dispatches work to the GPU and continues executing without waiting for completion. Without explicit GPU synchronization, timing measurements would only capture the time to launch CUDA kernels, not the time for them to complete. The ZeRO-Inference benchmarking framework addresses this with two synchronization mechanisms:

Timer-level synchronization: The timers("generate-forward") timer calls get_accelerator().synchronize() before recording start and stop timestamps.
Hook-level synchronization: The model forward hooks call torch.cuda.synchronize() before recording prefill start/end times.

Warm-up and Iteration

The benchmarking loop runs for a configurable number of iterations (--loops, default 3). The last iteration is used for reporting rather than averaging, under the assumption that:

The first iterations warm up CUDA contexts, JIT compilation, and memory allocators.
The last iteration reflects steady-state performance.

Theoretical Basis

Throughput Metrics

Metric	Formula	Description
Prefill throughput	`batch_size * prompt_len / prefill_latency`	Tokens processed per second during prompt encoding
Decode throughput	`batch_size * (gen_len - 1) / decode_latency`	Tokens generated per second during autoregressive decoding
Total throughput	`batch_size * gen_len / total_latency`	End-to-end tokens generated per second

The decode throughput uses gen_len - 1 because the first token in the generation is produced as part of the prefill phase (the model outputs the first new token at the end of processing the prompt).

Latency Decomposition

total_latency = prefill_latency + decode_latency

Prefill and decode have fundamentally different computational profiles:

Phase	Bottleneck	Scaling	With Offloading
Prefill	Compute (FLOPS)	Proportional to `prompt_len * hidden_size^2`	Parameter fetch overlaps with computation
Decode	Memory bandwidth	Proportional to `gen_len * model_size / bandwidth`	Each token requires parameter fetch from CPU/NVMe

Model Size Estimation

The model size in bytes is estimated from architectural parameters:

model_bytes = 2 * (num_layers * (
    # self-attention: Q, K, V projections + output projection
    hidden_size * (3 * hidden_size + 1) + hidden_size * (hidden_size + 1) +
    # MLP: up-projection + down-projection
    hidden_size * (4 * hidden_size + 1) + hidden_size * 4 * (hidden_size + 1) +
    # layer norms (2 per layer, each with weight + bias)
    hidden_size * 4
) +
# token embedding + LM head
vocab_size * (hidden_size + 1))

The factor of 2 accounts for FP16 representation (2 bytes per parameter).

KV Cache Size

cache_bytes = 2 * batch_size * seq_len * num_layers * hidden_size * 2

where:

First factor of 2: keys and values
seq_len = prompt_len + gen_len
Last factor of 2: FP16 (2 bytes per element)

Hidden State Size

hidden_bytes = batch_size * seq_len * hidden_size * 2

This represents the memory for a single layer's hidden state activations in FP16.

Benchmark Log Format

The benchmark log records all metrics in a structured, human-readable text format:

model size: 326.914 GB    cache size: 1.875 GB    hidden size (p): 0.010 GB
peak gpu mem: 12.341 GB   prefill latency: 8.234 s   prefill throughput: 497.52 token/s
decode latency: 42.106 s  decode throughput: 5.89 token/s
total latency: 50.340 s   total throughput: 5.08 token/s

Each line of the log captures a complementary dimension of system behavior:

Line 1: Memory characteristics (model size, cache overhead, hidden state size)
Line 2: Peak GPU utilization and prefill performance
Line 3: Decode performance (typically the bottleneck for offloaded models)
Line 4: End-to-end performance summary

Automatic Log File Naming

When output_file="auto", the log filename encodes the full experiment configuration:

ds-{model}-bs{batch}-prompt{prompt_len}-gen{gen_len}-n{nodes}x{gpus}-{offload}[-kv_offload][-w_quant].log

Examples:

ds-opt-175b-bs8-prompt512-gen32-n1x1-cpu-kv_offload-w_quant.log
ds-bloom-bs32-prompt512-gen32-n1x1-disk-kv_offload.log
ds-Llama-2-70b-hf-bs64-prompt512-gen32-n1x1-cpu.log

Performance Comparison Reference

The following table shows representative throughput measurements (tokens/second) from the ZeRO-Inference benchmarks on a single NVIDIA A6000 (48 GB HBM, 252 GB CPU RAM):

Model	Weight Quant	KV Offload	Batch Size	Throughput (tok/s)
OPT-30B	4-bit	No	24	22.74
OPT-30B	No	Yes	96	12.32
OPT-30B	4-bit	Yes	128	19.34
OPT-66B	4-bit	Yes	64	8.08
OPT-175B	4-bit	Yes	24	2.26
BLOOM-176B	4-bit	Yes	24	1.33
LLaMA-2-70B	4-bit	No	96	24.05

Related Pages

Implementation:Microsoft_DeepSpeedExamples_Write_Benchmark_Log

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment