Implementation:Huggingface Transformers Time Generate Measurement
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Performance, Profiling |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete tool for executing timed inference measurement iterations with per-token timestamping and concurrent GPU monitoring, provided by the HuggingFace Transformers benchmark framework.
Description
The measurement phase of BenchmarkRunner.run_benchmark calls time_generate(config, warmup=False) for config.measurement_iterations iterations, collecting end-to-end latency, per-token timestamps (via BenchmarkStreamer), decoded outputs, and GPU hardware metrics (via GPUMonitor) on each iteration. Results are accumulated into a BenchmarkResult object. The BenchmarkStreamer is a custom BaseStreamer subclass that records a time.perf_counter() timestamp in its put() method each time a token is generated. The GPUMonitor runs a separate process that samples GPU utilization and memory at configurable intervals (default: 50ms) using platform-specific backends (pynvml for NVIDIA, amdsmi for AMD, xpu-smi for Intel).
Usage
The measurement phase is executed automatically as part of run_benchmark after warmup completes. The number of measurement iterations is controlled by BenchmarkConfig.measurement_iterations. GPU monitoring is controlled by BenchmarkConfig.gpu_monitoring.
Code Reference
Source Location
- Repository: transformers
- Files:
benchmark_v2/framework/benchmark_runner.py(lines 112-136 forBenchmarkStreamer, lines 238-302 for measurement loop andtime_generate)benchmark_v2/framework/hardware_metrics.py(lines 156-325 forGPUMonitor)
Signature
class BenchmarkStreamer(BaseStreamer):
def __init__(self, **kwargs) -> None:
...
def put(self, value):
...
def end(self):
...
class GPUMonitor:
def __init__(self, sample_interval_sec: float = 0.05, logger: Logger | None = None):
...
def start(self):
...
def stop_and_collect(self) -> GPURawMetrics:
...
def time_generate(
self, config: BenchmarkConfig, warmup: bool
) -> tuple[float, list[float], str, GPURawMetrics | None]:
...
Import
from benchmark_v2.framework.benchmark_runner import BenchmarkRunner, BenchmarkStreamer
from benchmark_v2.framework.hardware_metrics import GPUMonitor
I/O Contract
Inputs (time_generate in measurement mode)
| Name | Type | Required | Description |
|---|---|---|---|
| config | BenchmarkConfig |
Yes | Benchmark configuration. Controls GPU monitoring, batching mode, and token generation count. |
| warmup | bool |
Yes | Set to False for measurement. Enables GPU monitoring if config.gpu_monitoring is True.
|
Outputs (time_generate)
| Name | Type | Description |
|---|---|---|
| e2e_latency | float |
Wall-clock generation time in seconds, measured via time.perf_counter().
|
| timestamps | list[list[float]] |
Per-batch-element lists of per-token timestamps (seconds relative to generation start). |
| shape_and_decoded_output | str |
String containing the output tensor shape and decoded text of the first sequence. |
| gpu_metrics | None | GPU utilization and memory samples collected during generation, or None if monitoring was disabled.
|
GPURawMetrics Fields
| Name | Type | Description |
|---|---|---|
| utilization | list[float] |
GPU utilization percentage samples. |
| memory_used | list[float] |
GPU memory usage in GB at each sample point. |
| timestamps | list[float] |
Sample timestamps in seconds, relative to the first sample. |
| timestamp_0 | float |
Absolute time of the first sample (time.time()).
|
| monitoring_status | GPUMonitoringStatus |
One of: SUCCESS, FAILED, NO_GPUS_AVAILABLE, NO_SAMPLES_COLLECTED.
|
Internal Behavior
Each measurement iteration within run_benchmark proceeds as follows:
- GPU monitor start: If
config.gpu_monitoringisTrueandwarmupisFalse, aGPUMonitoris created andstart()is called, spawning a child process that begins sampling. - Generation: For standard mode, a
BenchmarkStreameris created and passed tomodel.generate(**inputs, streamer=streamer). For continuous batching mode,model.generate_batch(inputs, allow_block_sharing=False, record_timestamps=True)is called. - Wall-clock timing:
time.perf_counter()is recorded before and after the generate call. - GPU monitor stop:
gpu_monitor.stop_and_collect()sends a stop signal to the child process, receives the collected metrics, and terminates the process. - Timestamp extraction: For standard mode, the streamer's timestamps (excluding the first, which corresponds to input tokens) are collected. For continuous batching, timestamps are extracted from the output objects.
- Output validation: The number of generated tokens is compared to
config.num_tokens_to_generate; a mismatch raisesRuntimeError. - Decoding: The first sequence's output tokens are decoded to text.
- Memory cleanup:
flush_memory(flush_compile=False)clears GPU cache without resetting compile state. - Accumulation:
result.accumulate(e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics)adds the iteration's data to theBenchmarkResult.
Usage Examples
Basic Usage
import logging
from benchmark_v2.framework.benchmark_runner import BenchmarkRunner
from benchmark_v2.framework.benchmark_config import BenchmarkConfig
logger = logging.getLogger("benchmark")
runner = BenchmarkRunner(logger=logger)
config = BenchmarkConfig(
warmup_iterations=5,
measurement_iterations=20,
gpu_monitoring=True,
attn_implementation="flash_attention_2",
batch_size=1,
sequence_length=128,
num_tokens_to_generate=128,
)
runner.setup_benchmark("meta-llama/Llama-3-8B", config)
result = runner.run_benchmark(config)
# result.e2e_latency is a list of 20 float values (seconds)
# result.time_to_first_token is a list of 20 float values (seconds)
# result.inter_token_latency is a list of 20 float values (seconds)
# result.gpu_metrics is a list of 20 GPURawMetrics objects
Accessing Per-Token Timestamps
# After running a benchmark
for i, latency in enumerate(result.e2e_latency):
print(f"Iteration {i}: {latency:.3f}s")
# Compute throughput
throughput = result.get_throughput(total_generated_tokens=1 * 128)
print(f"Avg throughput: {sum(throughput)/len(throughput):.1f} tok/s")