Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft DeepSpeedExamples Inference Performance Measurement

From Leeroopedia


Sources

Domains

  • Performance
  • Benchmarking
  • Inference

Overview

A benchmarking methodology for measuring inference performance including prefill latency, decode throughput, and memory utilization.

Description

Performance measurement for LLM inference tracks multiple complementary metrics that together characterize system behavior. The ZeRO-Inference benchmarking framework measures:

  1. Prefill latency: Time to process the input prompt (all tokens processed in parallel).
  2. Decode latency: Time to generate all output tokens (autoregressive, one token at a time).
  3. Total latency: End-to-end time for the complete generation (prefill + decode).
  4. Prefill throughput: Rate of prompt token processing (tokens/second).
  5. Decode throughput: Rate of token generation (tokens/second).
  6. Total throughput: Overall token generation rate (tokens/second).
  7. Peak GPU memory: Maximum GPU HBM allocated during the generation.
  8. Model size: Total parameter memory in bytes (computed from architecture).
  9. KV cache size: Memory consumed by key-value attention caches.
  10. Hidden state size: Memory for intermediate hidden representations.

Why GPU-Synchronized Timers Are Critical

CUDA operations are asynchronous by default: the CPU dispatches work to the GPU and continues executing without waiting for completion. Without explicit GPU synchronization, timing measurements would only capture the time to launch CUDA kernels, not the time for them to complete. The ZeRO-Inference benchmarking framework addresses this with two synchronization mechanisms:

  • Timer-level synchronization: The timers("generate-forward") timer calls get_accelerator().synchronize() before recording start and stop timestamps.
  • Hook-level synchronization: The model forward hooks call torch.cuda.synchronize() before recording prefill start/end times.

Warm-up and Iteration

The benchmarking loop runs for a configurable number of iterations (--loops, default 3). The last iteration is used for reporting rather than averaging, under the assumption that:

  • The first iterations warm up CUDA contexts, JIT compilation, and memory allocators.
  • The last iteration reflects steady-state performance.

Theoretical Basis

Throughput Metrics

Metric Formula Description
Prefill throughput batch_size * prompt_len / prefill_latency Tokens processed per second during prompt encoding
Decode throughput batch_size * (gen_len - 1) / decode_latency Tokens generated per second during autoregressive decoding
Total throughput batch_size * gen_len / total_latency End-to-end tokens generated per second

The decode throughput uses gen_len - 1 because the first token in the generation is produced as part of the prefill phase (the model outputs the first new token at the end of processing the prompt).

Latency Decomposition

total_latency = prefill_latency + decode_latency

Prefill and decode have fundamentally different computational profiles:

Phase Bottleneck Scaling With Offloading
Prefill Compute (FLOPS) Proportional to prompt_len * hidden_size^2 Parameter fetch overlaps with computation
Decode Memory bandwidth Proportional to gen_len * model_size / bandwidth Each token requires parameter fetch from CPU/NVMe

Model Size Estimation

The model size in bytes is estimated from architectural parameters:

model_bytes = 2 * (num_layers * (
    # self-attention: Q, K, V projections + output projection
    hidden_size * (3 * hidden_size + 1) + hidden_size * (hidden_size + 1) +
    # MLP: up-projection + down-projection
    hidden_size * (4 * hidden_size + 1) + hidden_size * 4 * (hidden_size + 1) +
    # layer norms (2 per layer, each with weight + bias)
    hidden_size * 4
) +
# token embedding + LM head
vocab_size * (hidden_size + 1))

The factor of 2 accounts for FP16 representation (2 bytes per parameter).

KV Cache Size

cache_bytes = 2 * batch_size * seq_len * num_layers * hidden_size * 2

where:

  • First factor of 2: keys and values
  • seq_len = prompt_len + gen_len
  • Last factor of 2: FP16 (2 bytes per element)

Hidden State Size

hidden_bytes = batch_size * seq_len * hidden_size * 2

This represents the memory for a single layer's hidden state activations in FP16.

Benchmark Log Format

The benchmark log records all metrics in a structured, human-readable text format:

model size: 326.914 GB    cache size: 1.875 GB    hidden size (p): 0.010 GB
peak gpu mem: 12.341 GB   prefill latency: 8.234 s   prefill throughput: 497.52 token/s
decode latency: 42.106 s  decode throughput: 5.89 token/s
total latency: 50.340 s   total throughput: 5.08 token/s

Each line of the log captures a complementary dimension of system behavior:

  • Line 1: Memory characteristics (model size, cache overhead, hidden state size)
  • Line 2: Peak GPU utilization and prefill performance
  • Line 3: Decode performance (typically the bottleneck for offloaded models)
  • Line 4: End-to-end performance summary

Automatic Log File Naming

When output_file="auto", the log filename encodes the full experiment configuration:

ds-{model}-bs{batch}-prompt{prompt_len}-gen{gen_len}-n{nodes}x{gpus}-{offload}[-kv_offload][-w_quant].log

Examples:

  • ds-opt-175b-bs8-prompt512-gen32-n1x1-cpu-kv_offload-w_quant.log
  • ds-bloom-bs32-prompt512-gen32-n1x1-disk-kv_offload.log
  • ds-Llama-2-70b-hf-bs64-prompt512-gen32-n1x1-cpu.log

Performance Comparison Reference

The following table shows representative throughput measurements (tokens/second) from the ZeRO-Inference benchmarks on a single NVIDIA A6000 (48 GB HBM, 252 GB CPU RAM):

Model Weight Quant KV Offload Batch Size Throughput (tok/s)
OPT-30B 4-bit No 24 22.74
OPT-30B No Yes 96 12.32
OPT-30B 4-bit Yes 128 19.34
OPT-66B 4-bit Yes 64 8.08
OPT-175B 4-bit Yes 24 2.26
BLOOM-176B 4-bit Yes 24 1.33
LLaMA-2-70B 4-bit No 96 24.05

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment