Implementation:Microsoft DeepSpeedExamples Write Benchmark Log

Overview

Concrete tool for recording inference benchmark metrics to structured log files.

Description

The write_benchmark_log function formats and persists performance measurements from ZeRO-Inference generation runs. It creates a human-readable, tab-separated log entry containing model size, cache size, hidden state size, peak GPU memory, and four throughput/latency pairs (prefill, decode, total). The log is appended to a file, enabling multiple benchmark configurations to be recorded in sequence.

Supporting utility functions compute the raw metric values from model configuration and runtime measurements:

model_bytes(config): Estimates total model parameter memory from architectural dimensions (layers, hidden size, vocabulary).
cache_bytes(config, batch_size, seq_len): Computes KV cache memory from layer count, hidden size, batch size, and sequence length.
hidden_bytes(config, batch_size, seq_len): Computes single-layer hidden state memory.
get_filename(...): Generates a descriptive log filename encoding the experiment configuration.

Code Reference

Source

Repository	File	Lines
DeepSpeedExamples	`inference/huggingface/zero_inference/utils.py`	86-103 (write_benchmark_log)
DeepSpeedExamples	`inference/huggingface/zero_inference/utils.py`	27-43 (model_bytes, cache_bytes, hidden_bytes)
DeepSpeedExamples	`inference/huggingface/zero_inference/utils.py`	105-123 (get_filename)

Signature: write_benchmark_log

def write_benchmark_log(
    filename,             # str: output file path
    model_size,           # int: total model size in bytes
    cache_size,           # int: KV cache size in bytes
    hidden_size,          # int: hidden state size in bytes
    gpu_peak_mem,         # int: peak GPU memory in bytes
    prefill_latency,      # float: prefill phase time in seconds
    prefill_throughput,   # float: prefill tokens per second
    decode_latency,       # float: decode phase time in seconds
    decode_throughput,    # float: decode tokens per second
    total_latency,        # float: total generation time in seconds
    total_throughput,     # float: total tokens per second
) -> str:
    """Format and write benchmark metrics to a log file.

    Returns:
        log_str (str): The formatted log string that was written.
    """

Signature: model_bytes

def model_bytes(config) -> int:
    """Estimate total model parameter memory in bytes (FP16).

    Accounts for:
    - Self-attention: Q/K/V projections + output projection + biases
    - MLP: up-projection + down-projection + biases
    - Layer norms: weight + bias for 2 norms per layer
    - Embedding: vocabulary embedding + bias
    """

Signature: cache_bytes

def cache_bytes(config, batch_size, seq_len) -> int:
    """Compute KV cache memory in bytes.

    Formula: 2 * batch_size * seq_len * num_hidden_layers * hidden_size * 2
    (2 for K+V, 2 for FP16 bytes)
    """

Signature: hidden_bytes

def hidden_bytes(config, batch_size, seq_len) -> int:
    """Compute hidden state memory for one layer in bytes.

    Formula: batch_size * seq_len * hidden_size * 2
    (2 for FP16 bytes)
    """

Signature: get_filename

def get_filename(
    model_name,          # str: HuggingFace model identifier
    batch_size,          # int: batch size
    prompt_len,          # int: prompt length
    gen_len,             # int: generation length
    cpu_offload,         # bool: CPU offloading enabled
    disk_offload,        # bool: NVMe offloading enabled
    num_nodes,           # int: number of compute nodes
    num_gpus_per_node,   # int: GPUs per node
    kv_offload,          # bool: KV cache offloading enabled
    weight_quantize,     # bool: weight quantization enabled
) -> str:
    """Generate a descriptive filename for the benchmark log."""

Import

from utils import (GB, model_bytes, cache_bytes, hidden_bytes,
                   get_filename, write_benchmark_log)

I/O Contract

write_benchmark_log Inputs

Parameter	Type	Required	Description
`filename`	`str`	Yes	Path to the output log file (created or appended)
`model_size`	`int`	Yes	Total model size in bytes (from `model_bytes()`)
`cache_size`	`int`	Yes	KV cache size in bytes (from `cache_bytes()`)
`hidden_size`	`int`	Yes	Hidden state size in bytes (from `hidden_bytes()`)
`gpu_peak_mem`	`int`	Yes	Peak GPU memory in bytes (from `get_accelerator().max_memory_allocated()`)
`prefill_latency`	`float`	Yes	Prefill phase duration in seconds
`prefill_throughput`	`float`	Yes	Prefill token processing rate (tokens/second)
`decode_latency`	`float`	Yes	Decode phase duration in seconds
`decode_throughput`	`float`	Yes	Token generation rate during decoding (tokens/second)
`total_latency`	`float`	Yes	Total generation duration in seconds
`total_throughput`	`float`	Yes	End-to-end token generation rate (tokens/second)

write_benchmark_log Outputs

Name	Type	Description
`log_str`	`str`	The formatted log string that was written to the file
Log file (side effect)	File	The formatted string is appended to `filename` with a trailing newline

model_bytes Inputs

Parameter	Type	Required	Description
`config`	`AutoConfig`	Yes	Model configuration with `hidden_size`, `num_hidden_layers`, and `vocab_size`

cache_bytes Inputs

Parameter	Type	Required	Description
`config`	`AutoConfig`	Yes	Model configuration with `num_hidden_layers` and `hidden_size`
`batch_size`	`int`	Yes	Batch size
`seq_len`	`int`	Yes	Total sequence length (`prompt_len + gen_len`)

Implementation Details

Log Format

The log output is a multi-line string with tab-separated fields, all sizes converted from bytes to GB:

def write_benchmark_log(filename, model_size, cache_size, hidden_size,
        gpu_peak_mem, prefill_latency, prefill_throughput,
        decode_latency, decode_throughput, total_latency, total_throughput):

    log_str = (f"model size: {model_size/GB:.3f} GB\t"
               f"cache size: {cache_size/GB:.3f} GB\t"
               f"hidden size (p): {hidden_size/GB:.3f} GB\n"
               f"peak gpu mem: {gpu_peak_mem / GB:.3f} GB\t"
               f"prefill latency: {prefill_latency:.3f} s\t"
               f"prefill throughput: {prefill_throughput:.3f} token/s\n"
               f"decode latency: {decode_latency:.3f} s\t"
               f"decode throughput: {decode_throughput:.3f} token/s\n"
               f"total latency: {total_latency:.3f} s\t"
               f"total throughput: {total_throughput:.3f} token/s")
    with open(filename, "a") as fout:
        fout.write(log_str + "\n")

    return log_str

Model Size Calculation

def model_bytes(config):
    h = config.hidden_size
    return 2 * (config.num_hidden_layers * (
        # self-attention: Q, K, V (3*h*h + 3*h) + output (h*h + h)
        h * (3 * h + 1) + h * (h + 1) +
        # MLP: up-projection (h*4h + 4h) + down-projection (4h*h + h)
        h * (4 * h + 1) + h * 4 * (h + 1) +
        # layer norms: 2 norms * (weight h + bias h) = 4h
        h * 4
    ) +
    # embedding: vocab_size * (h + 1)
    config.vocab_size * (h + 1))

Filename Generation

def get_filename(model_name, batch_size, prompt_len, gen_len,
                 cpu_offload, disk_offload, num_nodes, num_gpus_per_node,
                 kv_offload, weight_quantize):
    simple_name = model_name.split('/')[-1]
    filename = "ds-"
    filename += f"{simple_name}-bs{batch_size}-prompt{prompt_len}-gen{gen_len}-"
    filename += f"n{num_nodes}x{num_gpus_per_node}-"
    if cpu_offload:
        filename += "cpu"
    elif disk_offload:
        filename += "disk"
    else:
        filename += "gpu"
    if kv_offload:
        filename += "-kv_offload"
    if weight_quantize:
        filename += "-w_quant"
    return filename

Constants

Constant	Value	Usage
`KB`	`1 << 10` (1,024)	Kilobyte conversion
`MB`	`1 << 20` (1,048,576)	Megabyte conversion
`GB`	`1 << 30` (1,073,741,824)	Gigabyte conversion, used for log formatting
`T`	`1e12`	Trillion, available for FLOPS calculations

Usage Example

from utils import model_bytes, cache_bytes, hidden_bytes, write_benchmark_log
from transformers import AutoConfig

config = AutoConfig.from_pretrained("facebook/opt-66b")

# Compute memory metrics
m_size = model_bytes(config)        # ~123.5 GB for OPT-66B
c_size = cache_bytes(config, batch_size=16, seq_len=544)
h_size = hidden_bytes(config, batch_size=16, seq_len=544)

# Write benchmark results
log_str = write_benchmark_log(
    filename="ds-opt-66b-bs16-prompt512-gen32-n1x1-cpu-w_quant.log",
    model_size=m_size,
    cache_size=c_size,
    hidden_size=h_size,
    gpu_peak_mem=12_000_000_000,     # 12 GB peak
    prefill_latency=3.456,
    prefill_throughput=2370.37,
    decode_latency=52.789,
    decode_throughput=4.56,
    total_latency=56.245,
    total_throughput=9.10,
)

print(log_str)
# model size: 123.456 GB   cache size: 1.875 GB   hidden size (p): 0.010 GB
# peak gpu mem: 11.176 GB   prefill latency: 3.456 s   prefill throughput: 2370.370 token/s
# decode latency: 52.789 s  decode throughput: 4.560 token/s
# total latency: 56.245 s   total throughput: 9.100 token/s

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment