Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Transformers Time Generate Warmup

From Leeroopedia
Knowledge Sources
Domains Benchmarking, Performance, JIT Compilation
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete tool for executing untimed warmup iterations of model generation to stabilize JIT compilation and GPU state before measurement, provided by the HuggingFace Transformers benchmark framework.

Description

The warmup phase of BenchmarkRunner.run_benchmark calls time_generate(config, warmup=True) repeatedly to exercise the full generation pipeline without collecting measurement data. When warmup=True, GPU monitoring is disabled (the GPUMonitor is not started), so no hardware metrics are collected. The warmup loop is preceded by a single validation call that checks whether the configuration executes successfully; if the returned end-to-end latency is negative, the configuration is skipped entirely. The warmup runs for config.warmup_iterations iterations (default: 5), during which torch.compile graph tracing, CUDA kernel caching, and memory allocator pool establishment occur. After warmup completes, the benchmark proceeds to the measurement phase.

Usage

The warmup phase is executed automatically as part of run_benchmark. It is not typically called in isolation. The number of warmup iterations is controlled by BenchmarkConfig.warmup_iterations.

Code Reference

Source Location

  • Repository: transformers
  • File: benchmark_v2/framework/benchmark_runner.py (lines 219-236 for warmup orchestration, lines 254-302 for time_generate)

Signature

def run_benchmark(self, config: BenchmarkConfig, num_tokens_to_profile: int = 0) -> BenchmarkResult | None:
    """Run a single benchmark with the given model ID and config."""
    ...

def time_generate(
    self, config: BenchmarkConfig, warmup: bool
) -> tuple[float, list[float], str, GPURawMetrics | None]:
    ...

Import

from benchmark_v2.framework.benchmark_runner import BenchmarkRunner

I/O Contract

Inputs

Name Type Required Description
config BenchmarkConfig Yes Benchmark configuration. The warmup_iterations field controls the number of warmup runs.
warmup bool Yes Set to True during warmup. Disables GPU monitoring.

Outputs

Name Type Description
e2e_latency float Wall-clock generation time in seconds. A negative value from the validation call signals that the configuration should be skipped.
timestamps list[float] Per-token timestamps relative to generation start (collected but not used during warmup).
shape_and_decoded_output str Output shape and decoded text (collected but not used during warmup).
gpu_metrics None Always None during warmup since GPU monitoring is disabled.

Internal Behavior

The warmup phase within run_benchmark proceeds as follows:

  1. Validation call: flush_memory() is called to clear GPU state. A single time_generate(config, warmup=True) is executed. If the returned e2e_latency is negative, the method logs a warning and returns None, skipping this configuration.
  2. Warmup loop: Iterates config.warmup_iterations times (with a trange progress bar labeled "Warmup"), calling time_generate(config, warmup=True) each time. Results are discarded.
  3. Transition: After the warmup loop completes, execution proceeds to the measurement phase.

Inside time_generate when warmup=True:

  1. GPU monitoring is skipped (gpu_monitor = None).
  2. Generation proceeds normally via model.generate (standard) or model.generate_batch (continuous batching), with a BenchmarkStreamer attached in standard mode.
  3. Timing, token count validation, and memory flushing proceed identically to measurement mode.

Usage Examples

Basic Usage

import logging
from benchmark_v2.framework.benchmark_runner import BenchmarkRunner
from benchmark_v2.framework.benchmark_config import BenchmarkConfig

logger = logging.getLogger("benchmark")
runner = BenchmarkRunner(logger=logger)

config = BenchmarkConfig(
    warmup_iterations=5,
    measurement_iterations=20,
    attn_implementation="flex_attention",
    compile_kwargs={"mode": "default"},
)

# Load model
runner.setup_benchmark("meta-llama/Llama-3-8B", config)

# Run benchmark (warmup is automatic)
result = runner.run_benchmark(config)
# The first 5+1 iterations are warmup; the next 20 are measurement

Controlling Warmup Iterations

# More warmup for max-autotune mode (autotuning takes more iterations)
config = BenchmarkConfig(
    warmup_iterations=10,
    measurement_iterations=20,
    attn_implementation="sdpa",
    compile_kwargs={"mode": "max-autotune"},
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment