Overview
Concrete tool for recording inference benchmark metrics to structured log files.
Description
The write_benchmark_log function formats and persists performance measurements from ZeRO-Inference generation runs. It creates a human-readable, tab-separated log entry containing model size, cache size, hidden state size, peak GPU memory, and four throughput/latency pairs (prefill, decode, total). The log is appended to a file, enabling multiple benchmark configurations to be recorded in sequence.
Supporting utility functions compute the raw metric values from model configuration and runtime measurements:
model_bytes(config): Estimates total model parameter memory from architectural dimensions (layers, hidden size, vocabulary).
cache_bytes(config, batch_size, seq_len): Computes KV cache memory from layer count, hidden size, batch size, and sequence length.
hidden_bytes(config, batch_size, seq_len): Computes single-layer hidden state memory.
get_filename(...): Generates a descriptive log filename encoding the experiment configuration.
Code Reference
Source
| Repository |
File |
Lines
|
| DeepSpeedExamples |
inference/huggingface/zero_inference/utils.py |
86-103 (write_benchmark_log)
|
| DeepSpeedExamples |
inference/huggingface/zero_inference/utils.py |
27-43 (model_bytes, cache_bytes, hidden_bytes)
|
| DeepSpeedExamples |
inference/huggingface/zero_inference/utils.py |
105-123 (get_filename)
|
Signature: write_benchmark_log
def write_benchmark_log(
filename, # str: output file path
model_size, # int: total model size in bytes
cache_size, # int: KV cache size in bytes
hidden_size, # int: hidden state size in bytes
gpu_peak_mem, # int: peak GPU memory in bytes
prefill_latency, # float: prefill phase time in seconds
prefill_throughput, # float: prefill tokens per second
decode_latency, # float: decode phase time in seconds
decode_throughput, # float: decode tokens per second
total_latency, # float: total generation time in seconds
total_throughput, # float: total tokens per second
) -> str:
"""Format and write benchmark metrics to a log file.
Returns:
log_str (str): The formatted log string that was written.
"""
Signature: model_bytes
def model_bytes(config) -> int:
"""Estimate total model parameter memory in bytes (FP16).
Accounts for:
- Self-attention: Q/K/V projections + output projection + biases
- MLP: up-projection + down-projection + biases
- Layer norms: weight + bias for 2 norms per layer
- Embedding: vocabulary embedding + bias
"""
Signature: cache_bytes
def cache_bytes(config, batch_size, seq_len) -> int:
"""Compute KV cache memory in bytes.
Formula: 2 * batch_size * seq_len * num_hidden_layers * hidden_size * 2
(2 for K+V, 2 for FP16 bytes)
"""
Signature: hidden_bytes
def hidden_bytes(config, batch_size, seq_len) -> int:
"""Compute hidden state memory for one layer in bytes.
Formula: batch_size * seq_len * hidden_size * 2
(2 for FP16 bytes)
"""
Signature: get_filename
def get_filename(
model_name, # str: HuggingFace model identifier
batch_size, # int: batch size
prompt_len, # int: prompt length
gen_len, # int: generation length
cpu_offload, # bool: CPU offloading enabled
disk_offload, # bool: NVMe offloading enabled
num_nodes, # int: number of compute nodes
num_gpus_per_node, # int: GPUs per node
kv_offload, # bool: KV cache offloading enabled
weight_quantize, # bool: weight quantization enabled
) -> str:
"""Generate a descriptive filename for the benchmark log."""
Import
from utils import (GB, model_bytes, cache_bytes, hidden_bytes,
get_filename, write_benchmark_log)
I/O Contract
write_benchmark_log Inputs
| Parameter |
Type |
Required |
Description
|
filename |
str |
Yes |
Path to the output log file (created or appended)
|
model_size |
int |
Yes |
Total model size in bytes (from model_bytes())
|
cache_size |
int |
Yes |
KV cache size in bytes (from cache_bytes())
|
hidden_size |
int |
Yes |
Hidden state size in bytes (from hidden_bytes())
|
gpu_peak_mem |
int |
Yes |
Peak GPU memory in bytes (from get_accelerator().max_memory_allocated())
|
prefill_latency |
float |
Yes |
Prefill phase duration in seconds
|
prefill_throughput |
float |
Yes |
Prefill token processing rate (tokens/second)
|
decode_latency |
float |
Yes |
Decode phase duration in seconds
|
decode_throughput |
float |
Yes |
Token generation rate during decoding (tokens/second)
|
total_latency |
float |
Yes |
Total generation duration in seconds
|
total_throughput |
float |
Yes |
End-to-end token generation rate (tokens/second)
|
write_benchmark_log Outputs
| Name |
Type |
Description
|
log_str |
str |
The formatted log string that was written to the file
|
| Log file (side effect) |
File |
The formatted string is appended to filename with a trailing newline
|
model_bytes Inputs
| Parameter |
Type |
Required |
Description
|
config |
AutoConfig |
Yes |
Model configuration with hidden_size, num_hidden_layers, and vocab_size
|
cache_bytes Inputs
| Parameter |
Type |
Required |
Description
|
config |
AutoConfig |
Yes |
Model configuration with num_hidden_layers and hidden_size
|
batch_size |
int |
Yes |
Batch size
|
seq_len |
int |
Yes |
Total sequence length (prompt_len + gen_len)
|
Implementation Details
Log Format
The log output is a multi-line string with tab-separated fields, all sizes converted from bytes to GB:
def write_benchmark_log(filename, model_size, cache_size, hidden_size,
gpu_peak_mem, prefill_latency, prefill_throughput,
decode_latency, decode_throughput, total_latency, total_throughput):
log_str = (f"model size: {model_size/GB:.3f} GB\t"
f"cache size: {cache_size/GB:.3f} GB\t"
f"hidden size (p): {hidden_size/GB:.3f} GB\n"
f"peak gpu mem: {gpu_peak_mem / GB:.3f} GB\t"
f"prefill latency: {prefill_latency:.3f} s\t"
f"prefill throughput: {prefill_throughput:.3f} token/s\n"
f"decode latency: {decode_latency:.3f} s\t"
f"decode throughput: {decode_throughput:.3f} token/s\n"
f"total latency: {total_latency:.3f} s\t"
f"total throughput: {total_throughput:.3f} token/s")
with open(filename, "a") as fout:
fout.write(log_str + "\n")
return log_str
Model Size Calculation
def model_bytes(config):
h = config.hidden_size
return 2 * (config.num_hidden_layers * (
# self-attention: Q, K, V (3*h*h + 3*h) + output (h*h + h)
h * (3 * h + 1) + h * (h + 1) +
# MLP: up-projection (h*4h + 4h) + down-projection (4h*h + h)
h * (4 * h + 1) + h * 4 * (h + 1) +
# layer norms: 2 norms * (weight h + bias h) = 4h
h * 4
) +
# embedding: vocab_size * (h + 1)
config.vocab_size * (h + 1))
Filename Generation
def get_filename(model_name, batch_size, prompt_len, gen_len,
cpu_offload, disk_offload, num_nodes, num_gpus_per_node,
kv_offload, weight_quantize):
simple_name = model_name.split('/')[-1]
filename = "ds-"
filename += f"{simple_name}-bs{batch_size}-prompt{prompt_len}-gen{gen_len}-"
filename += f"n{num_nodes}x{num_gpus_per_node}-"
if cpu_offload:
filename += "cpu"
elif disk_offload:
filename += "disk"
else:
filename += "gpu"
if kv_offload:
filename += "-kv_offload"
if weight_quantize:
filename += "-w_quant"
return filename
Constants
| Constant |
Value |
Usage
|
KB |
1 << 10 (1,024) |
Kilobyte conversion
|
MB |
1 << 20 (1,048,576) |
Megabyte conversion
|
GB |
1 << 30 (1,073,741,824) |
Gigabyte conversion, used for log formatting
|
T |
1e12 |
Trillion, available for FLOPS calculations
|
Usage Example
from utils import model_bytes, cache_bytes, hidden_bytes, write_benchmark_log
from transformers import AutoConfig
config = AutoConfig.from_pretrained("facebook/opt-66b")
# Compute memory metrics
m_size = model_bytes(config) # ~123.5 GB for OPT-66B
c_size = cache_bytes(config, batch_size=16, seq_len=544)
h_size = hidden_bytes(config, batch_size=16, seq_len=544)
# Write benchmark results
log_str = write_benchmark_log(
filename="ds-opt-66b-bs16-prompt512-gen32-n1x1-cpu-w_quant.log",
model_size=m_size,
cache_size=c_size,
hidden_size=h_size,
gpu_peak_mem=12_000_000_000, # 12 GB peak
prefill_latency=3.456,
prefill_throughput=2370.37,
decode_latency=52.789,
decode_throughput=4.56,
total_latency=56.245,
total_throughput=9.10,
)
print(log_str)
# model size: 123.456 GB cache size: 1.875 GB hidden size (p): 0.010 GB
# peak gpu mem: 11.176 GB prefill latency: 3.456 s prefill throughput: 2370.370 token/s
# decode latency: 52.789 s decode throughput: 4.560 token/s
# total latency: 56.245 s total throughput: 9.100 token/s
Related Pages