Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Huggingface Transformers Time Generate Measurement

From Leeroopedia
Knowledge Sources
Domains Benchmarking, Performance, Profiling
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete tool for executing timed inference measurement iterations with per-token timestamping and concurrent GPU monitoring, provided by the HuggingFace Transformers benchmark framework.

Description

The measurement phase of BenchmarkRunner.run_benchmark calls time_generate(config, warmup=False) for config.measurement_iterations iterations, collecting end-to-end latency, per-token timestamps (via BenchmarkStreamer), decoded outputs, and GPU hardware metrics (via GPUMonitor) on each iteration. Results are accumulated into a BenchmarkResult object. The BenchmarkStreamer is a custom BaseStreamer subclass that records a time.perf_counter() timestamp in its put() method each time a token is generated. The GPUMonitor runs a separate process that samples GPU utilization and memory at configurable intervals (default: 50ms) using platform-specific backends (pynvml for NVIDIA, amdsmi for AMD, xpu-smi for Intel).

Usage

The measurement phase is executed automatically as part of run_benchmark after warmup completes. The number of measurement iterations is controlled by BenchmarkConfig.measurement_iterations. GPU monitoring is controlled by BenchmarkConfig.gpu_monitoring.

Code Reference

Source Location

  • Repository: transformers
  • Files:
    • benchmark_v2/framework/benchmark_runner.py (lines 112-136 for BenchmarkStreamer, lines 238-302 for measurement loop and time_generate)
    • benchmark_v2/framework/hardware_metrics.py (lines 156-325 for GPUMonitor)

Signature

class BenchmarkStreamer(BaseStreamer):
    def __init__(self, **kwargs) -> None:
        ...
    def put(self, value):
        ...
    def end(self):
        ...

class GPUMonitor:
    def __init__(self, sample_interval_sec: float = 0.05, logger: Logger | None = None):
        ...
    def start(self):
        ...
    def stop_and_collect(self) -> GPURawMetrics:
        ...

def time_generate(
    self, config: BenchmarkConfig, warmup: bool
) -> tuple[float, list[float], str, GPURawMetrics | None]:
    ...

Import

from benchmark_v2.framework.benchmark_runner import BenchmarkRunner, BenchmarkStreamer
from benchmark_v2.framework.hardware_metrics import GPUMonitor

I/O Contract

Inputs (time_generate in measurement mode)

Name Type Required Description
config BenchmarkConfig Yes Benchmark configuration. Controls GPU monitoring, batching mode, and token generation count.
warmup bool Yes Set to False for measurement. Enables GPU monitoring if config.gpu_monitoring is True.

Outputs (time_generate)

Name Type Description
e2e_latency float Wall-clock generation time in seconds, measured via time.perf_counter().
timestamps list[list[float]] Per-batch-element lists of per-token timestamps (seconds relative to generation start).
shape_and_decoded_output str String containing the output tensor shape and decoded text of the first sequence.
gpu_metrics None GPU utilization and memory samples collected during generation, or None if monitoring was disabled.

GPURawMetrics Fields

Name Type Description
utilization list[float] GPU utilization percentage samples.
memory_used list[float] GPU memory usage in GB at each sample point.
timestamps list[float] Sample timestamps in seconds, relative to the first sample.
timestamp_0 float Absolute time of the first sample (time.time()).
monitoring_status GPUMonitoringStatus One of: SUCCESS, FAILED, NO_GPUS_AVAILABLE, NO_SAMPLES_COLLECTED.

Internal Behavior

Each measurement iteration within run_benchmark proceeds as follows:

  1. GPU monitor start: If config.gpu_monitoring is True and warmup is False, a GPUMonitor is created and start() is called, spawning a child process that begins sampling.
  2. Generation: For standard mode, a BenchmarkStreamer is created and passed to model.generate(**inputs, streamer=streamer). For continuous batching mode, model.generate_batch(inputs, allow_block_sharing=False, record_timestamps=True) is called.
  3. Wall-clock timing: time.perf_counter() is recorded before and after the generate call.
  4. GPU monitor stop: gpu_monitor.stop_and_collect() sends a stop signal to the child process, receives the collected metrics, and terminates the process.
  5. Timestamp extraction: For standard mode, the streamer's timestamps (excluding the first, which corresponds to input tokens) are collected. For continuous batching, timestamps are extracted from the output objects.
  6. Output validation: The number of generated tokens is compared to config.num_tokens_to_generate; a mismatch raises RuntimeError.
  7. Decoding: The first sequence's output tokens are decoded to text.
  8. Memory cleanup: flush_memory(flush_compile=False) clears GPU cache without resetting compile state.
  9. Accumulation: result.accumulate(e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics) adds the iteration's data to the BenchmarkResult.

Usage Examples

Basic Usage

import logging
from benchmark_v2.framework.benchmark_runner import BenchmarkRunner
from benchmark_v2.framework.benchmark_config import BenchmarkConfig

logger = logging.getLogger("benchmark")
runner = BenchmarkRunner(logger=logger)

config = BenchmarkConfig(
    warmup_iterations=5,
    measurement_iterations=20,
    gpu_monitoring=True,
    attn_implementation="flash_attention_2",
    batch_size=1,
    sequence_length=128,
    num_tokens_to_generate=128,
)

runner.setup_benchmark("meta-llama/Llama-3-8B", config)
result = runner.run_benchmark(config)

# result.e2e_latency is a list of 20 float values (seconds)
# result.time_to_first_token is a list of 20 float values (seconds)
# result.inter_token_latency is a list of 20 float values (seconds)
# result.gpu_metrics is a list of 20 GPURawMetrics objects

Accessing Per-Token Timestamps

# After running a benchmark
for i, latency in enumerate(result.e2e_latency):
    print(f"Iteration {i}: {latency:.3f}s")

# Compute throughput
throughput = result.get_throughput(total_generated_tokens=1 * 128)
print(f"Avg throughput: {sum(throughput)/len(throughput):.1f} tok/s")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment