Implementation:Microsoft DeepSpeedExamples Run Generation

Overview

Concrete tool for executing text generation with ZeRO-partitioned models including KV cache offloading and benchmarking.

Description

The run_generation function is the main entry point for the ZeRO-Inference pipeline. It orchestrates the complete inference workflow: loading configuration and tokenizer, optionally creating dummy weights, initializing the DeepSpeed model, encoding prompts, running timed generation loops, computing performance metrics, and writing benchmark logs.

The function performs the following steps:

Load model configuration and tokenizer via get_model_config() and get_tokenizer().
Create dummy weights (if --dummy flag): Instantiates the model on meta device using init_empty_weights(), converts to CPU tensors via meta_to_cpu(), and saves to disk. This enables benchmarking without downloading real model weights.
Initialize DeepSpeed model via get_ds_model() within a torch.no_grad() context.
Encode prompts: Creates a batch of identical prompts ("Paris is the capital city of") and encodes them to fixed-length tensors with padding.
Configure KV cache offloading (if enabled) via model.set_kv_cache_offload().
Register timing hooks via add_model_hooks() for prefill/decode phase separation.
Run generation loop: Executes model.generate() for the configured number of benchmark iterations, using GPU-synchronized timers.
Compute metrics: Calculates prefill latency/throughput, decode latency/throughput, total latency/throughput, and peak GPU memory.
Write benchmark log via write_benchmark_log() and optionally print generated text.

Code Reference

Source

Repository	File	Lines
DeepSpeedExamples	`inference/huggingface/zero_inference/run_model.py`	173-402

Signature

def run_generation(
    model_name,          # str: HuggingFace model identifier
    batch_size,          # int: total batch size across all GPUs
    prompt_len,          # int: fixed prompt length in tokens
    gen_len,             # int: number of tokens to generate
    cpu_offload,         # bool: offload parameters to CPU
    disk_offload,        # bool: offload parameters to NVMe
    offload_dir,         # str: directory for offload storage
    num_nodes,           # int: number of compute nodes
    num_gpus_per_node,   # int: GPUs per node
    dummy,               # bool: use dummy weights for benchmarking
    output_file,         # str: log file path ("auto" for auto-naming)
    verbose,             # int: verbosity level (0, 1, or 2)
    kv_offload,          # bool: offload KV cache to CPU
    quant_bits,          # int: weight quantization bits (4, 8, or 16)
    quant_group_size,    # int: quantization group size
    pin_kv_cache,        # bool: use pinned memory for KV cache
    async_kv_offload,    # bool: use async CUDA copies for KV offload
    loops,               # int: number of benchmark iterations
):
    """Execute text generation with ZeRO-partitioned model and record benchmarks."""

Import

# run_generation is defined in run_model.py and uses:
import torch
import deepspeed
import deepspeed.comm as dist
from deepspeed.accelerator import get_accelerator
from accelerate import init_empty_weights
from timer import timers
from transformers import AutoTokenizer
from utils import (add_model_hooks, cache_bytes, get_filename,
                   hidden_bytes, meta_to_cpu, model_bytes, write_benchmark_log)

I/O Contract

Inputs

Parameter	Type	Required	Description
`model_name`	`str`	Yes	HuggingFace model identifier (e.g., `"facebook/opt-175b"`)
`batch_size`	`int`	Yes	Total batch size (divided by world size for per-GPU batch)
`prompt_len`	`int`	Yes	Fixed prompt length in tokens (default: 512)
`gen_len`	`int`	Yes	Number of tokens to generate (default: 32)
`cpu_offload`	`bool`	Yes	Offload model weights to CPU
`disk_offload`	`bool`	Yes	Offload model weights to NVMe
`offload_dir`	`str`	Yes	Directory for NVMe/dummy weight storage
`num_nodes`	`int`	Yes	Number of compute nodes
`num_gpus_per_node`	`int`	Yes	GPUs per node
`dummy`	`bool`	Yes	Use dummy weights for benchmarking
`output_file`	`str`	Yes	Log file name (`"auto"` for automatic naming)
`verbose`	`int`	Yes	0: silent, 1: metrics only, 2: metrics + generated text
`kv_offload`	`bool`	Yes	Enable KV cache CPU offloading
`quant_bits`	`int`	Yes	Quantization precision (4, 8, or 16)
`quant_group_size`	`int`	Yes	Weights per quantization group
`pin_kv_cache`	`bool`	Yes	Allocate KV cache in pinned CPU memory
`async_kv_offload`	`bool`	Yes	Use non-blocking copies for KV cache
`loops`	`int`	Yes	Number of generation iterations for benchmarking

Outputs

Output	Type	Description
Benchmark log file	File (`.log`)	Structured metrics appended to a file named by model/config
Console output	Text	Summary of latency, throughput, and optionally generated text
Return value	`None`	Function returns nothing; results are written to file and console

Timing Architecture

The timing system uses two mechanisms:

GPU-Synchronized Timer

The timers("generate-forward") timer from timer.py wraps the entire model.generate() call with GPU synchronization:

timer = timers("generate-forward")
for _ in range(loops):
    timer.start(sync_func=get_accelerator().synchronize)
    with torch.no_grad():
        set_model_stage(model, "prefill")
        output_ids = model.generate(**input_tokens, **generate_kwargs)
        prefill_timings.append(model.__duration__)
    timer.stop(sync_func=get_accelerator().synchronize)

The sync_func ensures CUDA operations complete before recording timestamps, preventing measurement of only the kernel launch time.

Model Forward Hooks

Registered by add_model_hooks(model), these hooks measure prefill latency separately:

def start_time_hook(module, input):
    if module.stage == 'prefill':
        torch.cuda.synchronize()
        module.__start_time__ = time.time()

def end_time_hook(module, input, output):
    if module.stage == 'prefill':
        torch.cuda.synchronize()
        module.__duration__ = time.time() - module.__start_time__
        module.stage = "decode"  # switch to decode after first forward pass

The hooks use torch.cuda.synchronize() to ensure accurate GPU timing. After the first forward pass (prefill), the stage is switched to "decode" so subsequent forward passes (one per generated token) are not timed by the hooks.

Metric Computation

After the generation loop completes, metrics are computed on rank 0 only (args.local_rank != 0 returns early):

total_latency = costs[-1]              # last iteration's total time
prefill_latency = prefill_timings[-1]  # last iteration's prefill time

prefill_throughput = batch_size * prompt_len / prefill_latency
decode_latency = total_latency - prefill_latency
decode_throughput = batch_size * (gen_len - 1) / max(decode_latency, 1e-10)
total_throughput = (batch_size * gen_len) / total_latency
gpu_peak_mem = get_accelerator().max_memory_allocated(torch.device("cuda"))

Metric	Formula	Units
Prefill throughput	`batch_size * prompt_len / prefill_latency`	tokens/s
Decode throughput	`batch_size * (gen_len - 1) / decode_latency`	tokens/s
Total throughput	`batch_size * gen_len / total_latency`	tokens/s
Prefill latency	Measured by model hooks	seconds
Decode latency	`total_latency - prefill_latency`	seconds
GPU peak memory	`get_accelerator().max_memory_allocated()`	bytes

Dummy Weight Generation

When --dummy is specified, the function creates synthetic model weights for benchmarking without requiring real model downloads:

with init_empty_weights():
    model = ModelClass(config)  # Allocates on meta device (zero memory)
model.save_pretrained(
    filename,
    state_dict=meta_to_cpu(model.state_dict(), torch.float16)
)

The init_empty_weights() context manager from the accelerate library allocates tensors on the meta device (no actual memory). meta_to_cpu() converts these to empty CPU tensors with the correct shapes and FP16 dtype, which are then saved as a pretrained model checkpoint.

Usage Example

# Called from __main__ after argument parsing:
run_generation(
    model_name="facebook/opt-175b",
    batch_size=8,
    prompt_len=512,
    gen_len=32,
    cpu_offload=True,
    disk_offload=False,
    offload_dir="/home/user/offload_dir",
    num_nodes=1,
    num_gpus_per_node=1,
    dummy=True,          # benchmark with dummy weights
    output_file="auto",  # auto-generate filename
    verbose=2,           # print metrics and generated text
    kv_offload=True,
    quant_bits=4,
    quant_group_size=64,
    pin_kv_cache=True,
    async_kv_offload=True,
    loops=3,
)
# Output: benchmark log written to
#   ds-opt-175b-bs8-prompt512-gen32-n1x1-cpu-kv_offload-w_quant.log

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment