Implementation:Huggingface Transformers Time Generate Warmup
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Performance, JIT Compilation |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete tool for executing untimed warmup iterations of model generation to stabilize JIT compilation and GPU state before measurement, provided by the HuggingFace Transformers benchmark framework.
Description
The warmup phase of BenchmarkRunner.run_benchmark calls time_generate(config, warmup=True) repeatedly to exercise the full generation pipeline without collecting measurement data. When warmup=True, GPU monitoring is disabled (the GPUMonitor is not started), so no hardware metrics are collected. The warmup loop is preceded by a single validation call that checks whether the configuration executes successfully; if the returned end-to-end latency is negative, the configuration is skipped entirely. The warmup runs for config.warmup_iterations iterations (default: 5), during which torch.compile graph tracing, CUDA kernel caching, and memory allocator pool establishment occur. After warmup completes, the benchmark proceeds to the measurement phase.
Usage
The warmup phase is executed automatically as part of run_benchmark. It is not typically called in isolation. The number of warmup iterations is controlled by BenchmarkConfig.warmup_iterations.
Code Reference
Source Location
- Repository: transformers
- File:
benchmark_v2/framework/benchmark_runner.py(lines 219-236 for warmup orchestration, lines 254-302 fortime_generate)
Signature
def run_benchmark(self, config: BenchmarkConfig, num_tokens_to_profile: int = 0) -> BenchmarkResult | None:
"""Run a single benchmark with the given model ID and config."""
...
def time_generate(
self, config: BenchmarkConfig, warmup: bool
) -> tuple[float, list[float], str, GPURawMetrics | None]:
...
Import
from benchmark_v2.framework.benchmark_runner import BenchmarkRunner
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | BenchmarkConfig |
Yes | Benchmark configuration. The warmup_iterations field controls the number of warmup runs.
|
| warmup | bool |
Yes | Set to True during warmup. Disables GPU monitoring.
|
Outputs
| Name | Type | Description |
|---|---|---|
| e2e_latency | float |
Wall-clock generation time in seconds. A negative value from the validation call signals that the configuration should be skipped. |
| timestamps | list[float] |
Per-token timestamps relative to generation start (collected but not used during warmup). |
| shape_and_decoded_output | str |
Output shape and decoded text (collected but not used during warmup). |
| gpu_metrics | None | Always None during warmup since GPU monitoring is disabled.
|
Internal Behavior
The warmup phase within run_benchmark proceeds as follows:
- Validation call:
flush_memory()is called to clear GPU state. A singletime_generate(config, warmup=True)is executed. If the returnede2e_latencyis negative, the method logs a warning and returnsNone, skipping this configuration. - Warmup loop: Iterates
config.warmup_iterationstimes (with atrangeprogress bar labeled "Warmup"), callingtime_generate(config, warmup=True)each time. Results are discarded. - Transition: After the warmup loop completes, execution proceeds to the measurement phase.
Inside time_generate when warmup=True:
- GPU monitoring is skipped (
gpu_monitor = None). - Generation proceeds normally via
model.generate(standard) ormodel.generate_batch(continuous batching), with aBenchmarkStreamerattached in standard mode. - Timing, token count validation, and memory flushing proceed identically to measurement mode.
Usage Examples
Basic Usage
import logging
from benchmark_v2.framework.benchmark_runner import BenchmarkRunner
from benchmark_v2.framework.benchmark_config import BenchmarkConfig
logger = logging.getLogger("benchmark")
runner = BenchmarkRunner(logger=logger)
config = BenchmarkConfig(
warmup_iterations=5,
measurement_iterations=20,
attn_implementation="flex_attention",
compile_kwargs={"mode": "default"},
)
# Load model
runner.setup_benchmark("meta-llama/Llama-3-8B", config)
# Run benchmark (warmup is automatic)
result = runner.run_benchmark(config)
# The first 5+1 iterations are warmup; the next 20 are measurement
Controlling Warmup Iterations
# More warmup for max-autotune mode (autotuning takes more iterations)
config = BenchmarkConfig(
warmup_iterations=10,
measurement_iterations=20,
attn_implementation="sdpa",
compile_kwargs={"mode": "max-autotune"},
)