Implementation:Microsoft DeepSpeedExamples Run Generation
Overview
Concrete tool for executing text generation with ZeRO-partitioned models including KV cache offloading and benchmarking.
Description
The run_generation function is the main entry point for the ZeRO-Inference pipeline. It orchestrates the complete inference workflow: loading configuration and tokenizer, optionally creating dummy weights, initializing the DeepSpeed model, encoding prompts, running timed generation loops, computing performance metrics, and writing benchmark logs.
The function performs the following steps:
- Load model configuration and tokenizer via
get_model_config()andget_tokenizer(). - Create dummy weights (if
--dummyflag): Instantiates the model on meta device usinginit_empty_weights(), converts to CPU tensors viameta_to_cpu(), and saves to disk. This enables benchmarking without downloading real model weights. - Initialize DeepSpeed model via
get_ds_model()within atorch.no_grad()context. - Encode prompts: Creates a batch of identical prompts (
"Paris is the capital city of") and encodes them to fixed-length tensors with padding. - Configure KV cache offloading (if enabled) via
model.set_kv_cache_offload(). - Register timing hooks via
add_model_hooks()for prefill/decode phase separation. - Run generation loop: Executes
model.generate()for the configured number of benchmark iterations, using GPU-synchronized timers. - Compute metrics: Calculates prefill latency/throughput, decode latency/throughput, total latency/throughput, and peak GPU memory.
- Write benchmark log via
write_benchmark_log()and optionally print generated text.
Code Reference
Source
| Repository | File | Lines |
|---|---|---|
| DeepSpeedExamples | inference/huggingface/zero_inference/run_model.py |
173-402 |
Signature
def run_generation(
model_name, # str: HuggingFace model identifier
batch_size, # int: total batch size across all GPUs
prompt_len, # int: fixed prompt length in tokens
gen_len, # int: number of tokens to generate
cpu_offload, # bool: offload parameters to CPU
disk_offload, # bool: offload parameters to NVMe
offload_dir, # str: directory for offload storage
num_nodes, # int: number of compute nodes
num_gpus_per_node, # int: GPUs per node
dummy, # bool: use dummy weights for benchmarking
output_file, # str: log file path ("auto" for auto-naming)
verbose, # int: verbosity level (0, 1, or 2)
kv_offload, # bool: offload KV cache to CPU
quant_bits, # int: weight quantization bits (4, 8, or 16)
quant_group_size, # int: quantization group size
pin_kv_cache, # bool: use pinned memory for KV cache
async_kv_offload, # bool: use async CUDA copies for KV offload
loops, # int: number of benchmark iterations
):
"""Execute text generation with ZeRO-partitioned model and record benchmarks."""
Import
# run_generation is defined in run_model.py and uses:
import torch
import deepspeed
import deepspeed.comm as dist
from deepspeed.accelerator import get_accelerator
from accelerate import init_empty_weights
from timer import timers
from transformers import AutoTokenizer
from utils import (add_model_hooks, cache_bytes, get_filename,
hidden_bytes, meta_to_cpu, model_bytes, write_benchmark_log)
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
model_name |
str |
Yes | HuggingFace model identifier (e.g., "facebook/opt-175b")
|
batch_size |
int |
Yes | Total batch size (divided by world size for per-GPU batch) |
prompt_len |
int |
Yes | Fixed prompt length in tokens (default: 512) |
gen_len |
int |
Yes | Number of tokens to generate (default: 32) |
cpu_offload |
bool |
Yes | Offload model weights to CPU |
disk_offload |
bool |
Yes | Offload model weights to NVMe |
offload_dir |
str |
Yes | Directory for NVMe/dummy weight storage |
num_nodes |
int |
Yes | Number of compute nodes |
num_gpus_per_node |
int |
Yes | GPUs per node |
dummy |
bool |
Yes | Use dummy weights for benchmarking |
output_file |
str |
Yes | Log file name ("auto" for automatic naming)
|
verbose |
int |
Yes | 0: silent, 1: metrics only, 2: metrics + generated text |
kv_offload |
bool |
Yes | Enable KV cache CPU offloading |
quant_bits |
int |
Yes | Quantization precision (4, 8, or 16) |
quant_group_size |
int |
Yes | Weights per quantization group |
pin_kv_cache |
bool |
Yes | Allocate KV cache in pinned CPU memory |
async_kv_offload |
bool |
Yes | Use non-blocking copies for KV cache |
loops |
int |
Yes | Number of generation iterations for benchmarking |
Outputs
| Output | Type | Description |
|---|---|---|
| Benchmark log file | File (.log) |
Structured metrics appended to a file named by model/config |
| Console output | Text | Summary of latency, throughput, and optionally generated text |
| Return value | None |
Function returns nothing; results are written to file and console |
Timing Architecture
The timing system uses two mechanisms:
GPU-Synchronized Timer
The timers("generate-forward") timer from timer.py wraps the entire model.generate() call with GPU synchronization:
timer = timers("generate-forward")
for _ in range(loops):
timer.start(sync_func=get_accelerator().synchronize)
with torch.no_grad():
set_model_stage(model, "prefill")
output_ids = model.generate(**input_tokens, **generate_kwargs)
prefill_timings.append(model.__duration__)
timer.stop(sync_func=get_accelerator().synchronize)
The sync_func ensures CUDA operations complete before recording timestamps, preventing measurement of only the kernel launch time.
Model Forward Hooks
Registered by add_model_hooks(model), these hooks measure prefill latency separately:
def start_time_hook(module, input):
if module.stage == 'prefill':
torch.cuda.synchronize()
module.__start_time__ = time.time()
def end_time_hook(module, input, output):
if module.stage == 'prefill':
torch.cuda.synchronize()
module.__duration__ = time.time() - module.__start_time__
module.stage = "decode" # switch to decode after first forward pass
The hooks use torch.cuda.synchronize() to ensure accurate GPU timing. After the first forward pass (prefill), the stage is switched to "decode" so subsequent forward passes (one per generated token) are not timed by the hooks.
Metric Computation
After the generation loop completes, metrics are computed on rank 0 only (args.local_rank != 0 returns early):
total_latency = costs[-1] # last iteration's total time
prefill_latency = prefill_timings[-1] # last iteration's prefill time
prefill_throughput = batch_size * prompt_len / prefill_latency
decode_latency = total_latency - prefill_latency
decode_throughput = batch_size * (gen_len - 1) / max(decode_latency, 1e-10)
total_throughput = (batch_size * gen_len) / total_latency
gpu_peak_mem = get_accelerator().max_memory_allocated(torch.device("cuda"))
| Metric | Formula | Units |
|---|---|---|
| Prefill throughput | batch_size * prompt_len / prefill_latency |
tokens/s |
| Decode throughput | batch_size * (gen_len - 1) / decode_latency |
tokens/s |
| Total throughput | batch_size * gen_len / total_latency |
tokens/s |
| Prefill latency | Measured by model hooks | seconds |
| Decode latency | total_latency - prefill_latency |
seconds |
| GPU peak memory | get_accelerator().max_memory_allocated() |
bytes |
Dummy Weight Generation
When --dummy is specified, the function creates synthetic model weights for benchmarking without requiring real model downloads:
with init_empty_weights():
model = ModelClass(config) # Allocates on meta device (zero memory)
model.save_pretrained(
filename,
state_dict=meta_to_cpu(model.state_dict(), torch.float16)
)
The init_empty_weights() context manager from the accelerate library allocates tensors on the meta device (no actual memory). meta_to_cpu() converts these to empty CPU tensors with the correct shapes and FP16 dtype, which are then saved as a pretrained model checkpoint.
Usage Example
# Called from __main__ after argument parsing:
run_generation(
model_name="facebook/opt-175b",
batch_size=8,
prompt_len=512,
gen_len=32,
cpu_offload=True,
disk_offload=False,
offload_dir="/home/user/offload_dir",
num_nodes=1,
num_gpus_per_node=1,
dummy=True, # benchmark with dummy weights
output_file="auto", # auto-generate filename
verbose=2, # print metrics and generated text
kv_offload=True,
quant_bits=4,
quant_group_size=64,
pin_kv_cache=True,
async_kv_offload=True,
loops=3,
)
# Output: benchmark log written to
# ds-opt-175b-bs8-prompt512-gen32-n1x1-cpu-kv_offload-w_quant.log