Implementation:Huggingface Transformers Time Generate Warmup

Knowledge Sources	Transformers
Domains	Benchmarking, Performance, JIT Compilation
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete tool for executing untimed warmup iterations of model generation to stabilize JIT compilation and GPU state before measurement, provided by the HuggingFace Transformers benchmark framework.

Description

The warmup phase of BenchmarkRunner.run_benchmark calls time_generate(config, warmup=True) repeatedly to exercise the full generation pipeline without collecting measurement data. When warmup=True, GPU monitoring is disabled (the GPUMonitor is not started), so no hardware metrics are collected. The warmup loop is preceded by a single validation call that checks whether the configuration executes successfully; if the returned end-to-end latency is negative, the configuration is skipped entirely. The warmup runs for config.warmup_iterations iterations (default: 5), during which torch.compile graph tracing, CUDA kernel caching, and memory allocator pool establishment occur. After warmup completes, the benchmark proceeds to the measurement phase.

Usage

The warmup phase is executed automatically as part of run_benchmark. It is not typically called in isolation. The number of warmup iterations is controlled by BenchmarkConfig.warmup_iterations.

Code Reference

Source Location

Repository: transformers
File: benchmark_v2/framework/benchmark_runner.py (lines 219-236 for warmup orchestration, lines 254-302 for time_generate)

Signature

def run_benchmark(self, config: BenchmarkConfig, num_tokens_to_profile: int = 0) -> BenchmarkResult | None:
    """Run a single benchmark with the given model ID and config."""
    ...

def time_generate(
    self, config: BenchmarkConfig, warmup: bool
) -> tuple[float, list[float], str, GPURawMetrics | None]:
    ...

Import

from benchmark_v2.framework.benchmark_runner import BenchmarkRunner

I/O Contract

Inputs

Name	Type	Required	Description
config	`BenchmarkConfig`	Yes	Benchmark configuration. The `warmup_iterations` field controls the number of warmup runs.
warmup	`bool`	Yes	Set to `True` during warmup. Disables GPU monitoring.

Outputs

Name	Type	Description
e2e_latency	`float`	Wall-clock generation time in seconds. A negative value from the validation call signals that the configuration should be skipped.
timestamps	`list[float]`	Per-token timestamps relative to generation start (collected but not used during warmup).
shape_and_decoded_output	`str`	Output shape and decoded text (collected but not used during warmup).
gpu_metrics	None	Always `None` during warmup since GPU monitoring is disabled.

Internal Behavior

The warmup phase within run_benchmark proceeds as follows:

Validation call: flush_memory() is called to clear GPU state. A single time_generate(config, warmup=True) is executed. If the returned e2e_latency is negative, the method logs a warning and returns None, skipping this configuration.
Warmup loop: Iterates config.warmup_iterations times (with a trange progress bar labeled "Warmup"), calling time_generate(config, warmup=True) each time. Results are discarded.
Transition: After the warmup loop completes, execution proceeds to the measurement phase.

Inside time_generate when warmup=True:

GPU monitoring is skipped (gpu_monitor = None).
Generation proceeds normally via model.generate (standard) or model.generate_batch (continuous batching), with a BenchmarkStreamer attached in standard mode.
Timing, token count validation, and memory flushing proceed identically to measurement mode.

Usage Examples

Basic Usage

import logging
from benchmark_v2.framework.benchmark_runner import BenchmarkRunner
from benchmark_v2.framework.benchmark_config import BenchmarkConfig

logger = logging.getLogger("benchmark")
runner = BenchmarkRunner(logger=logger)

config = BenchmarkConfig(
    warmup_iterations=5,
    measurement_iterations=20,
    attn_implementation="flex_attention",
    compile_kwargs={"mode": "default"},
)

# Load model
runner.setup_benchmark("meta-llama/Llama-3-8B", config)

# Run benchmark (warmup is automatic)
result = runner.run_benchmark(config)
# The first 5+1 iterations are warmup; the next 20 are measurement

Controlling Warmup Iterations

# More warmup for max-autotune mode (autotuning takes more iterations)
config = BenchmarkConfig(
    warmup_iterations=10,
    measurement_iterations=20,
    attn_implementation="sdpa",
    compile_kwargs={"mode": "max-autotune"},
)

Related Pages

Implements Principle

Principle:Huggingface_Transformers_JIT_Warmup

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment