Implementation:Huggingface Transformers Time Generate Measurement

Knowledge Sources	Transformers
Domains	Benchmarking, Performance, Profiling
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete tool for executing timed inference measurement iterations with per-token timestamping and concurrent GPU monitoring, provided by the HuggingFace Transformers benchmark framework.

Description

The measurement phase of BenchmarkRunner.run_benchmark calls time_generate(config, warmup=False) for config.measurement_iterations iterations, collecting end-to-end latency, per-token timestamps (via BenchmarkStreamer), decoded outputs, and GPU hardware metrics (via GPUMonitor) on each iteration. Results are accumulated into a BenchmarkResult object. The BenchmarkStreamer is a custom BaseStreamer subclass that records a time.perf_counter() timestamp in its put() method each time a token is generated. The GPUMonitor runs a separate process that samples GPU utilization and memory at configurable intervals (default: 50ms) using platform-specific backends (pynvml for NVIDIA, amdsmi for AMD, xpu-smi for Intel).

Usage

The measurement phase is executed automatically as part of run_benchmark after warmup completes. The number of measurement iterations is controlled by BenchmarkConfig.measurement_iterations. GPU monitoring is controlled by BenchmarkConfig.gpu_monitoring.

Code Reference

Source Location

Repository: transformers
Files:
- benchmark_v2/framework/benchmark_runner.py (lines 112-136 for BenchmarkStreamer, lines 238-302 for measurement loop and time_generate)
- benchmark_v2/framework/hardware_metrics.py (lines 156-325 for GPUMonitor)

Signature

class BenchmarkStreamer(BaseStreamer):
    def __init__(self, **kwargs) -> None:
        ...
    def put(self, value):
        ...
    def end(self):
        ...

class GPUMonitor:
    def __init__(self, sample_interval_sec: float = 0.05, logger: Logger | None = None):
        ...
    def start(self):
        ...
    def stop_and_collect(self) -> GPURawMetrics:
        ...

def time_generate(
    self, config: BenchmarkConfig, warmup: bool
) -> tuple[float, list[float], str, GPURawMetrics | None]:
    ...

Import

from benchmark_v2.framework.benchmark_runner import BenchmarkRunner, BenchmarkStreamer
from benchmark_v2.framework.hardware_metrics import GPUMonitor

I/O Contract

Inputs (`time_generate` in measurement mode)

Name	Type	Required	Description
config	`BenchmarkConfig`	Yes	Benchmark configuration. Controls GPU monitoring, batching mode, and token generation count.
warmup	`bool`	Yes	Set to `False` for measurement. Enables GPU monitoring if `config.gpu_monitoring` is `True`.

Outputs (`time_generate`)

Name	Type	Description
e2e_latency	`float`	Wall-clock generation time in seconds, measured via `time.perf_counter()`.
timestamps	`list[list[float]]`	Per-batch-element lists of per-token timestamps (seconds relative to generation start).
shape_and_decoded_output	`str`	String containing the output tensor shape and decoded text of the first sequence.
gpu_metrics	None	GPU utilization and memory samples collected during generation, or `None` if monitoring was disabled.

GPURawMetrics Fields

Name	Type	Description
utilization	`list[float]`	GPU utilization percentage samples.
memory_used	`list[float]`	GPU memory usage in GB at each sample point.
timestamps	`list[float]`	Sample timestamps in seconds, relative to the first sample.
timestamp_0	`float`	Absolute time of the first sample (`time.time()`).
monitoring_status	`GPUMonitoringStatus`	One of: `SUCCESS`, `FAILED`, `NO_GPUS_AVAILABLE`, `NO_SAMPLES_COLLECTED`.

Internal Behavior

Each measurement iteration within run_benchmark proceeds as follows:

GPU monitor start: If config.gpu_monitoring is True and warmup is False, a GPUMonitor is created and start() is called, spawning a child process that begins sampling.
Generation: For standard mode, a BenchmarkStreamer is created and passed to model.generate(**inputs, streamer=streamer). For continuous batching mode, model.generate_batch(inputs, allow_block_sharing=False, record_timestamps=True) is called.
Wall-clock timing: time.perf_counter() is recorded before and after the generate call.
GPU monitor stop: gpu_monitor.stop_and_collect() sends a stop signal to the child process, receives the collected metrics, and terminates the process.
Timestamp extraction: For standard mode, the streamer's timestamps (excluding the first, which corresponds to input tokens) are collected. For continuous batching, timestamps are extracted from the output objects.
Output validation: The number of generated tokens is compared to config.num_tokens_to_generate; a mismatch raises RuntimeError.
Decoding: The first sequence's output tokens are decoded to text.
Memory cleanup: flush_memory(flush_compile=False) clears GPU cache without resetting compile state.
Accumulation: result.accumulate(e2e_latency, timestamps, shape_and_decoded_output, gpu_metrics) adds the iteration's data to the BenchmarkResult.

Usage Examples

Basic Usage

import logging
from benchmark_v2.framework.benchmark_runner import BenchmarkRunner
from benchmark_v2.framework.benchmark_config import BenchmarkConfig

logger = logging.getLogger("benchmark")
runner = BenchmarkRunner(logger=logger)

config = BenchmarkConfig(
    warmup_iterations=5,
    measurement_iterations=20,
    gpu_monitoring=True,
    attn_implementation="flash_attention_2",
    batch_size=1,
    sequence_length=128,
    num_tokens_to_generate=128,
)

runner.setup_benchmark("meta-llama/Llama-3-8B", config)
result = runner.run_benchmark(config)

# result.e2e_latency is a list of 20 float values (seconds)
# result.time_to_first_token is a list of 20 float values (seconds)
# result.inter_token_latency is a list of 20 float values (seconds)
# result.gpu_metrics is a list of 20 GPURawMetrics objects

Accessing Per-Token Timestamps

# After running a benchmark
for i, latency in enumerate(result.e2e_latency):
    print(f"Iteration {i}: {latency:.3f}s")

# Compute throughput
throughput = result.get_throughput(total_generated_tokens=1 * 128)
print(f"Avg throughput: {sum(throughput)/len(throughput):.1f} tok/s")

Related Pages

Implements Principle

Principle:Huggingface_Transformers_Inference_Measurement

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment