Workflow:Huggingface Transformers Model Benchmarking
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Benchmarking, Performance, Inference |
| Last Updated | 2026-02-13 20:00 GMT |
Overview
End-to-end process for systematically measuring and comparing inference performance of Transformer models across different configurations, attention implementations, and optimization techniques.
Description
This workflow covers the v2 benchmark framework for measuring model inference performance. The framework supports benchmarking across multiple attention implementations (Flash Attention 2, SDPA, Flex Attention, Eager), compilation modes (torch.compile with various optimization levels), kernel optimizations, and continuous batching. It measures key metrics including end-to-end latency, time-to-first-token (TTFT), inter-token latency (ITL), and throughput, with optional GPU utilization monitoring. Results can be saved locally and uploaded to HuggingFace Hub datasets for tracking across commits.
Usage
Execute this workflow when you need to evaluate the inference performance of a model under different configurations, compare attention backend performance, measure the impact of torch.compile optimizations, or establish performance baselines for regression tracking across code changes.
Execution Steps
Step 1: Configure Benchmark Parameters
Define the benchmark configuration specifying input dimensions (batch size, sequence length, tokens to generate), the number of warmup and measurement iterations, and which optimization techniques to test. Configuration can be specified via CLI arguments or loaded from a JSON/JSONL config file.
Key considerations:
- batch_size, sequence_length, and num_tokens_to_generate define the input shape
- warmup_iterations (default 5) stabilizes GPU state before measurement
- measurement_iterations (default 20) determines statistical reliability
- Coverage levels (0-4) control how many configuration combinations are tested
- Multiple values for each parameter create a cross-product of configurations
Step 2: Generate Configuration Matrix
Expand the benchmark parameters into a matrix of individual test configurations. Each configuration combines a specific attention implementation, compile mode, and input shape. Invalid combinations (e.g., Flash Attention with certain compile modes) are automatically filtered out.
Key considerations:
- Coverage level 0: Minimal (flex_attention without compile)
- Coverage level 1: Standard (adds flash_attention_2, eager+compile, continuous batching)
- Coverage level 2-4: Progressively more comprehensive combinations
- Invalid combinations are automatically skipped with appropriate validation
Step 3: Model and Tokenizer Loading
For each benchmark configuration, load the model and tokenizer with the specified settings. This includes applying the attention implementation, compile configuration, and precision settings. Prepare standardized input prompts for consistent benchmarking.
Key considerations:
- Model is loaded with AutoModelForCausalLM.from_pretrained()
- Attention implementation is set via attn_implementation parameter
- torch.compile is applied when specified in the configuration
- A fixed default prompt ensures consistent input across configurations
Step 4: Warmup Phase
Execute warmup iterations to stabilize GPU state, fill caches, and trigger any lazy compilation. Warmup results are discarded and not included in measurements.
Key considerations:
- Warmup eliminates first-run effects (kernel compilation, memory allocation)
- GPU cache is cleared between benchmark configurations
- torch.compile cache is flushed for accurate per-configuration measurement
Step 5: Measurement Phase
Execute the measurement iterations, collecting detailed timing data for each run. A streaming callback captures per-token timestamps to compute fine-grained latency metrics. GPU utilization and memory metrics are optionally collected in a background monitoring thread.
Key considerations:
- Each iteration measures end-to-end latency via wall-clock timing
- BenchmarkStreamer captures individual token generation timestamps
- Time-to-first-token (TTFT) measures initial latency
- Inter-token latency (ITL) measures average time between consecutive tokens
- GPU metrics (utilization, memory) are sampled by a background monitor
Step 6: Statistical Aggregation
Compute summary statistics across all measurement iterations for each metric. Calculate mean, standard deviation, min, median, max, and 95th percentile for end-to-end latency, TTFT, ITL, and throughput.
Key considerations:
- Statistics are computed per-metric across all measurement iterations
- P95 latency is particularly useful for understanding tail latency
- Throughput is calculated as total tokens generated divided by end-to-end time
Step 7: Results Serialization and Reporting
Save results to JSON files organized by model name and timestamp. Optionally upload results to a HuggingFace Hub dataset for historical tracking. Display a summary table comparing all configurations.
Key considerations:
- Results include full configuration, raw measurements, and computed statistics
- HuggingFace Hub upload supports both full and summarized result formats
- Results are tagged with git commit ID for regression tracking
- JSON output enables custom analysis and visualization