Workflow:Huggingface Transformers Model Benchmarking

Knowledge Sources	Huggingface Transformers Transformers Benchmarks
Domains	LLMs, Benchmarking, Performance, Inference
Last Updated	2026-02-13 20:00 GMT

Overview

End-to-end process for systematically measuring and comparing inference performance of Transformer models across different configurations, attention implementations, and optimization techniques.

Description

This workflow covers the v2 benchmark framework for measuring model inference performance. The framework supports benchmarking across multiple attention implementations (Flash Attention 2, SDPA, Flex Attention, Eager), compilation modes (torch.compile with various optimization levels), kernel optimizations, and continuous batching. It measures key metrics including end-to-end latency, time-to-first-token (TTFT), inter-token latency (ITL), and throughput, with optional GPU utilization monitoring. Results can be saved locally and uploaded to HuggingFace Hub datasets for tracking across commits.

Usage

Execute this workflow when you need to evaluate the inference performance of a model under different configurations, compare attention backend performance, measure the impact of torch.compile optimizations, or establish performance baselines for regression tracking across code changes.

Execution Steps

Step 1: Configure Benchmark Parameters

Define the benchmark configuration specifying input dimensions (batch size, sequence length, tokens to generate), the number of warmup and measurement iterations, and which optimization techniques to test. Configuration can be specified via CLI arguments or loaded from a JSON/JSONL config file.

Key considerations:

batch_size, sequence_length, and num_tokens_to_generate define the input shape
warmup_iterations (default 5) stabilizes GPU state before measurement
measurement_iterations (default 20) determines statistical reliability
Coverage levels (0-4) control how many configuration combinations are tested
Multiple values for each parameter create a cross-product of configurations

Step 2: Generate Configuration Matrix

Expand the benchmark parameters into a matrix of individual test configurations. Each configuration combines a specific attention implementation, compile mode, and input shape. Invalid combinations (e.g., Flash Attention with certain compile modes) are automatically filtered out.

Key considerations:

Coverage level 0: Minimal (flex_attention without compile)
Coverage level 1: Standard (adds flash_attention_2, eager+compile, continuous batching)
Coverage level 2-4: Progressively more comprehensive combinations
Invalid combinations are automatically skipped with appropriate validation

Step 3: Model and Tokenizer Loading

For each benchmark configuration, load the model and tokenizer with the specified settings. This includes applying the attention implementation, compile configuration, and precision settings. Prepare standardized input prompts for consistent benchmarking.

Key considerations:

Model is loaded with AutoModelForCausalLM.from_pretrained()
Attention implementation is set via attn_implementation parameter
torch.compile is applied when specified in the configuration
A fixed default prompt ensures consistent input across configurations

Step 4: Warmup Phase

Execute warmup iterations to stabilize GPU state, fill caches, and trigger any lazy compilation. Warmup results are discarded and not included in measurements.

Key considerations:

Warmup eliminates first-run effects (kernel compilation, memory allocation)
GPU cache is cleared between benchmark configurations
torch.compile cache is flushed for accurate per-configuration measurement

Step 5: Measurement Phase

Execute the measurement iterations, collecting detailed timing data for each run. A streaming callback captures per-token timestamps to compute fine-grained latency metrics. GPU utilization and memory metrics are optionally collected in a background monitoring thread.

Key considerations:

Each iteration measures end-to-end latency via wall-clock timing
BenchmarkStreamer captures individual token generation timestamps
Time-to-first-token (TTFT) measures initial latency
Inter-token latency (ITL) measures average time between consecutive tokens
GPU metrics (utilization, memory) are sampled by a background monitor

Step 6: Statistical Aggregation

Compute summary statistics across all measurement iterations for each metric. Calculate mean, standard deviation, min, median, max, and 95th percentile for end-to-end latency, TTFT, ITL, and throughput.

Key considerations:

Statistics are computed per-metric across all measurement iterations
P95 latency is particularly useful for understanding tail latency
Throughput is calculated as total tokens generated divided by end-to-end time

Step 7: Results Serialization and Reporting

Save results to JSON files organized by model name and timestamp. Optionally upload results to a HuggingFace Hub dataset for historical tracking. Display a summary table comparing all configurations.

Key considerations:

Results include full configuration, raw measurements, and computed statistics
HuggingFace Hub upload supports both full and summarized result formats
Results are tagged with git commit ID for regression tracking
JSON output enables custom analysis and visualization

Execution Diagram

GitHub URL

Workflow Repository