Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft DeepSpeedExamples Write Benchmark Log

From Leeroopedia


Overview

Concrete tool for recording inference benchmark metrics to structured log files.

Description

The write_benchmark_log function formats and persists performance measurements from ZeRO-Inference generation runs. It creates a human-readable, tab-separated log entry containing model size, cache size, hidden state size, peak GPU memory, and four throughput/latency pairs (prefill, decode, total). The log is appended to a file, enabling multiple benchmark configurations to be recorded in sequence.

Supporting utility functions compute the raw metric values from model configuration and runtime measurements:

  • model_bytes(config): Estimates total model parameter memory from architectural dimensions (layers, hidden size, vocabulary).
  • cache_bytes(config, batch_size, seq_len): Computes KV cache memory from layer count, hidden size, batch size, and sequence length.
  • hidden_bytes(config, batch_size, seq_len): Computes single-layer hidden state memory.
  • get_filename(...): Generates a descriptive log filename encoding the experiment configuration.

Code Reference

Source

Repository File Lines
DeepSpeedExamples inference/huggingface/zero_inference/utils.py 86-103 (write_benchmark_log)
DeepSpeedExamples inference/huggingface/zero_inference/utils.py 27-43 (model_bytes, cache_bytes, hidden_bytes)
DeepSpeedExamples inference/huggingface/zero_inference/utils.py 105-123 (get_filename)

Signature: write_benchmark_log

def write_benchmark_log(
    filename,             # str: output file path
    model_size,           # int: total model size in bytes
    cache_size,           # int: KV cache size in bytes
    hidden_size,          # int: hidden state size in bytes
    gpu_peak_mem,         # int: peak GPU memory in bytes
    prefill_latency,      # float: prefill phase time in seconds
    prefill_throughput,   # float: prefill tokens per second
    decode_latency,       # float: decode phase time in seconds
    decode_throughput,    # float: decode tokens per second
    total_latency,        # float: total generation time in seconds
    total_throughput,     # float: total tokens per second
) -> str:
    """Format and write benchmark metrics to a log file.

    Returns:
        log_str (str): The formatted log string that was written.
    """

Signature: model_bytes

def model_bytes(config) -> int:
    """Estimate total model parameter memory in bytes (FP16).

    Accounts for:
    - Self-attention: Q/K/V projections + output projection + biases
    - MLP: up-projection + down-projection + biases
    - Layer norms: weight + bias for 2 norms per layer
    - Embedding: vocabulary embedding + bias
    """

Signature: cache_bytes

def cache_bytes(config, batch_size, seq_len) -> int:
    """Compute KV cache memory in bytes.

    Formula: 2 * batch_size * seq_len * num_hidden_layers * hidden_size * 2
    (2 for K+V, 2 for FP16 bytes)
    """

Signature: hidden_bytes

def hidden_bytes(config, batch_size, seq_len) -> int:
    """Compute hidden state memory for one layer in bytes.

    Formula: batch_size * seq_len * hidden_size * 2
    (2 for FP16 bytes)
    """

Signature: get_filename

def get_filename(
    model_name,          # str: HuggingFace model identifier
    batch_size,          # int: batch size
    prompt_len,          # int: prompt length
    gen_len,             # int: generation length
    cpu_offload,         # bool: CPU offloading enabled
    disk_offload,        # bool: NVMe offloading enabled
    num_nodes,           # int: number of compute nodes
    num_gpus_per_node,   # int: GPUs per node
    kv_offload,          # bool: KV cache offloading enabled
    weight_quantize,     # bool: weight quantization enabled
) -> str:
    """Generate a descriptive filename for the benchmark log."""

Import

from utils import (GB, model_bytes, cache_bytes, hidden_bytes,
                   get_filename, write_benchmark_log)

I/O Contract

write_benchmark_log Inputs

Parameter Type Required Description
filename str Yes Path to the output log file (created or appended)
model_size int Yes Total model size in bytes (from model_bytes())
cache_size int Yes KV cache size in bytes (from cache_bytes())
hidden_size int Yes Hidden state size in bytes (from hidden_bytes())
gpu_peak_mem int Yes Peak GPU memory in bytes (from get_accelerator().max_memory_allocated())
prefill_latency float Yes Prefill phase duration in seconds
prefill_throughput float Yes Prefill token processing rate (tokens/second)
decode_latency float Yes Decode phase duration in seconds
decode_throughput float Yes Token generation rate during decoding (tokens/second)
total_latency float Yes Total generation duration in seconds
total_throughput float Yes End-to-end token generation rate (tokens/second)

write_benchmark_log Outputs

Name Type Description
log_str str The formatted log string that was written to the file
Log file (side effect) File The formatted string is appended to filename with a trailing newline

model_bytes Inputs

Parameter Type Required Description
config AutoConfig Yes Model configuration with hidden_size, num_hidden_layers, and vocab_size

cache_bytes Inputs

Parameter Type Required Description
config AutoConfig Yes Model configuration with num_hidden_layers and hidden_size
batch_size int Yes Batch size
seq_len int Yes Total sequence length (prompt_len + gen_len)

Implementation Details

Log Format

The log output is a multi-line string with tab-separated fields, all sizes converted from bytes to GB:

def write_benchmark_log(filename, model_size, cache_size, hidden_size,
        gpu_peak_mem, prefill_latency, prefill_throughput,
        decode_latency, decode_throughput, total_latency, total_throughput):

    log_str = (f"model size: {model_size/GB:.3f} GB\t"
               f"cache size: {cache_size/GB:.3f} GB\t"
               f"hidden size (p): {hidden_size/GB:.3f} GB\n"
               f"peak gpu mem: {gpu_peak_mem / GB:.3f} GB\t"
               f"prefill latency: {prefill_latency:.3f} s\t"
               f"prefill throughput: {prefill_throughput:.3f} token/s\n"
               f"decode latency: {decode_latency:.3f} s\t"
               f"decode throughput: {decode_throughput:.3f} token/s\n"
               f"total latency: {total_latency:.3f} s\t"
               f"total throughput: {total_throughput:.3f} token/s")
    with open(filename, "a") as fout:
        fout.write(log_str + "\n")

    return log_str

Model Size Calculation

def model_bytes(config):
    h = config.hidden_size
    return 2 * (config.num_hidden_layers * (
        # self-attention: Q, K, V (3*h*h + 3*h) + output (h*h + h)
        h * (3 * h + 1) + h * (h + 1) +
        # MLP: up-projection (h*4h + 4h) + down-projection (4h*h + h)
        h * (4 * h + 1) + h * 4 * (h + 1) +
        # layer norms: 2 norms * (weight h + bias h) = 4h
        h * 4
    ) +
    # embedding: vocab_size * (h + 1)
    config.vocab_size * (h + 1))

Filename Generation

def get_filename(model_name, batch_size, prompt_len, gen_len,
                 cpu_offload, disk_offload, num_nodes, num_gpus_per_node,
                 kv_offload, weight_quantize):
    simple_name = model_name.split('/')[-1]
    filename = "ds-"
    filename += f"{simple_name}-bs{batch_size}-prompt{prompt_len}-gen{gen_len}-"
    filename += f"n{num_nodes}x{num_gpus_per_node}-"
    if cpu_offload:
        filename += "cpu"
    elif disk_offload:
        filename += "disk"
    else:
        filename += "gpu"
    if kv_offload:
        filename += "-kv_offload"
    if weight_quantize:
        filename += "-w_quant"
    return filename

Constants

Constant Value Usage
KB 1 << 10 (1,024) Kilobyte conversion
MB 1 << 20 (1,048,576) Megabyte conversion
GB 1 << 30 (1,073,741,824) Gigabyte conversion, used for log formatting
T 1e12 Trillion, available for FLOPS calculations

Usage Example

from utils import model_bytes, cache_bytes, hidden_bytes, write_benchmark_log
from transformers import AutoConfig

config = AutoConfig.from_pretrained("facebook/opt-66b")

# Compute memory metrics
m_size = model_bytes(config)        # ~123.5 GB for OPT-66B
c_size = cache_bytes(config, batch_size=16, seq_len=544)
h_size = hidden_bytes(config, batch_size=16, seq_len=544)

# Write benchmark results
log_str = write_benchmark_log(
    filename="ds-opt-66b-bs16-prompt512-gen32-n1x1-cpu-w_quant.log",
    model_size=m_size,
    cache_size=c_size,
    hidden_size=h_size,
    gpu_peak_mem=12_000_000_000,     # 12 GB peak
    prefill_latency=3.456,
    prefill_throughput=2370.37,
    decode_latency=52.789,
    decode_throughput=4.56,
    total_latency=56.245,
    total_throughput=9.10,
)

print(log_str)
# model size: 123.456 GB   cache size: 1.875 GB   hidden size (p): 0.010 GB
# peak gpu mem: 11.176 GB   prefill latency: 3.456 s   prefill throughput: 2370.370 token/s
# decode latency: 52.789 s  decode throughput: 4.560 token/s
# total latency: 56.245 s   total throughput: 9.100 token/s

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment