Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Microsoft DeepSpeedExamples Run Generation

From Leeroopedia


Overview

Concrete tool for executing text generation with ZeRO-partitioned models including KV cache offloading and benchmarking.

Description

The run_generation function is the main entry point for the ZeRO-Inference pipeline. It orchestrates the complete inference workflow: loading configuration and tokenizer, optionally creating dummy weights, initializing the DeepSpeed model, encoding prompts, running timed generation loops, computing performance metrics, and writing benchmark logs.

The function performs the following steps:

  1. Load model configuration and tokenizer via get_model_config() and get_tokenizer().
  2. Create dummy weights (if --dummy flag): Instantiates the model on meta device using init_empty_weights(), converts to CPU tensors via meta_to_cpu(), and saves to disk. This enables benchmarking without downloading real model weights.
  3. Initialize DeepSpeed model via get_ds_model() within a torch.no_grad() context.
  4. Encode prompts: Creates a batch of identical prompts ("Paris is the capital city of") and encodes them to fixed-length tensors with padding.
  5. Configure KV cache offloading (if enabled) via model.set_kv_cache_offload().
  6. Register timing hooks via add_model_hooks() for prefill/decode phase separation.
  7. Run generation loop: Executes model.generate() for the configured number of benchmark iterations, using GPU-synchronized timers.
  8. Compute metrics: Calculates prefill latency/throughput, decode latency/throughput, total latency/throughput, and peak GPU memory.
  9. Write benchmark log via write_benchmark_log() and optionally print generated text.

Code Reference

Source

Repository File Lines
DeepSpeedExamples inference/huggingface/zero_inference/run_model.py 173-402

Signature

def run_generation(
    model_name,          # str: HuggingFace model identifier
    batch_size,          # int: total batch size across all GPUs
    prompt_len,          # int: fixed prompt length in tokens
    gen_len,             # int: number of tokens to generate
    cpu_offload,         # bool: offload parameters to CPU
    disk_offload,        # bool: offload parameters to NVMe
    offload_dir,         # str: directory for offload storage
    num_nodes,           # int: number of compute nodes
    num_gpus_per_node,   # int: GPUs per node
    dummy,               # bool: use dummy weights for benchmarking
    output_file,         # str: log file path ("auto" for auto-naming)
    verbose,             # int: verbosity level (0, 1, or 2)
    kv_offload,          # bool: offload KV cache to CPU
    quant_bits,          # int: weight quantization bits (4, 8, or 16)
    quant_group_size,    # int: quantization group size
    pin_kv_cache,        # bool: use pinned memory for KV cache
    async_kv_offload,    # bool: use async CUDA copies for KV offload
    loops,               # int: number of benchmark iterations
):
    """Execute text generation with ZeRO-partitioned model and record benchmarks."""

Import

# run_generation is defined in run_model.py and uses:
import torch
import deepspeed
import deepspeed.comm as dist
from deepspeed.accelerator import get_accelerator
from accelerate import init_empty_weights
from timer import timers
from transformers import AutoTokenizer
from utils import (add_model_hooks, cache_bytes, get_filename,
                   hidden_bytes, meta_to_cpu, model_bytes, write_benchmark_log)

I/O Contract

Inputs

Parameter Type Required Description
model_name str Yes HuggingFace model identifier (e.g., "facebook/opt-175b")
batch_size int Yes Total batch size (divided by world size for per-GPU batch)
prompt_len int Yes Fixed prompt length in tokens (default: 512)
gen_len int Yes Number of tokens to generate (default: 32)
cpu_offload bool Yes Offload model weights to CPU
disk_offload bool Yes Offload model weights to NVMe
offload_dir str Yes Directory for NVMe/dummy weight storage
num_nodes int Yes Number of compute nodes
num_gpus_per_node int Yes GPUs per node
dummy bool Yes Use dummy weights for benchmarking
output_file str Yes Log file name ("auto" for automatic naming)
verbose int Yes 0: silent, 1: metrics only, 2: metrics + generated text
kv_offload bool Yes Enable KV cache CPU offloading
quant_bits int Yes Quantization precision (4, 8, or 16)
quant_group_size int Yes Weights per quantization group
pin_kv_cache bool Yes Allocate KV cache in pinned CPU memory
async_kv_offload bool Yes Use non-blocking copies for KV cache
loops int Yes Number of generation iterations for benchmarking

Outputs

Output Type Description
Benchmark log file File (.log) Structured metrics appended to a file named by model/config
Console output Text Summary of latency, throughput, and optionally generated text
Return value None Function returns nothing; results are written to file and console

Timing Architecture

The timing system uses two mechanisms:

GPU-Synchronized Timer

The timers("generate-forward") timer from timer.py wraps the entire model.generate() call with GPU synchronization:

timer = timers("generate-forward")
for _ in range(loops):
    timer.start(sync_func=get_accelerator().synchronize)
    with torch.no_grad():
        set_model_stage(model, "prefill")
        output_ids = model.generate(**input_tokens, **generate_kwargs)
        prefill_timings.append(model.__duration__)
    timer.stop(sync_func=get_accelerator().synchronize)

The sync_func ensures CUDA operations complete before recording timestamps, preventing measurement of only the kernel launch time.

Model Forward Hooks

Registered by add_model_hooks(model), these hooks measure prefill latency separately:

def start_time_hook(module, input):
    if module.stage == 'prefill':
        torch.cuda.synchronize()
        module.__start_time__ = time.time()

def end_time_hook(module, input, output):
    if module.stage == 'prefill':
        torch.cuda.synchronize()
        module.__duration__ = time.time() - module.__start_time__
        module.stage = "decode"  # switch to decode after first forward pass

The hooks use torch.cuda.synchronize() to ensure accurate GPU timing. After the first forward pass (prefill), the stage is switched to "decode" so subsequent forward passes (one per generated token) are not timed by the hooks.

Metric Computation

After the generation loop completes, metrics are computed on rank 0 only (args.local_rank != 0 returns early):

total_latency = costs[-1]              # last iteration's total time
prefill_latency = prefill_timings[-1]  # last iteration's prefill time

prefill_throughput = batch_size * prompt_len / prefill_latency
decode_latency = total_latency - prefill_latency
decode_throughput = batch_size * (gen_len - 1) / max(decode_latency, 1e-10)
total_throughput = (batch_size * gen_len) / total_latency
gpu_peak_mem = get_accelerator().max_memory_allocated(torch.device("cuda"))
Metric Formula Units
Prefill throughput batch_size * prompt_len / prefill_latency tokens/s
Decode throughput batch_size * (gen_len - 1) / decode_latency tokens/s
Total throughput batch_size * gen_len / total_latency tokens/s
Prefill latency Measured by model hooks seconds
Decode latency total_latency - prefill_latency seconds
GPU peak memory get_accelerator().max_memory_allocated() bytes

Dummy Weight Generation

When --dummy is specified, the function creates synthetic model weights for benchmarking without requiring real model downloads:

with init_empty_weights():
    model = ModelClass(config)  # Allocates on meta device (zero memory)
model.save_pretrained(
    filename,
    state_dict=meta_to_cpu(model.state_dict(), torch.float16)
)

The init_empty_weights() context manager from the accelerate library allocates tensors on the meta device (no actual memory). meta_to_cpu() converts these to empty CPU tensors with the correct shapes and FP16 dtype, which are then saved as a pretrained model checkpoint.

Usage Example

# Called from __main__ after argument parsing:
run_generation(
    model_name="facebook/opt-175b",
    batch_size=8,
    prompt_len=512,
    gen_len=32,
    cpu_offload=True,
    disk_offload=False,
    offload_dir="/home/user/offload_dir",
    num_nodes=1,
    num_gpus_per_node=1,
    dummy=True,          # benchmark with dummy weights
    output_file="auto",  # auto-generate filename
    verbose=2,           # print metrics and generated text
    kv_offload=True,
    quant_bits=4,
    quant_group_size=64,
    pin_kv_cache=True,
    async_kv_offload=True,
    loops=3,
)
# Output: benchmark log written to
#   ds-opt-175b-bs8-prompt512-gen32-n1x1-cpu-kv_offload-w_quant.log

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment