Overview
Concrete tool for orchestrating model benchmarking runs comparing transformers baselines against optimized models on latency, throughput, and evaluation metrics provided by the Huggingface Optimum library.
Description
This module provides three key classes:
- Calibrator — Abstract base class for post-training calibration with static quantization. Subclasses implement the `fit()` method for specific backends.
- Run — Orchestrates a benchmarking comparison between a transformers model and an optimized model. It uses Optuna grid search over batch sizes and input lengths to measure latency/throughput, collects evaluation metrics, and produces a structured results dictionary.
- TimeBenchmark — Measures model inference latency and throughput by running the model forward pass for a configured duration, computing statistics (mean, std, percentiles) over collected latencies.
The module also defines the `task_processing_map` dictionary that maps task names to their corresponding TaskProcessor classes.
Usage
Subclass Run to implement backend-specific benchmarking (e.g., ONNX Runtime). The TimeBenchmark class is used internally by Run subclasses to measure inference performance. Calibrator is subclassed for backend-specific calibration logic.
Code Reference
Source Location
Signature
class Calibrator:
def __init__(
self,
calibration_dataset: "Dataset",
quantizer,
model_path,
qconfig,
calibration_params,
node_exclusion,
):
"""Initialize calibrator for static quantization."""
def fit(self):
"""Run calibration. Must be implemented by subclasses."""
raise NotImplementedError()
class Run:
def __init__(self, run_config: dict):
"""Initialize a benchmarking run.
Args:
run_config (dict): Parameters validated against RunConfig.
"""
def launch(self) -> dict:
"""Launch the full benchmark: time + evaluation."""
def _launch_time(self, trial):
"""Optuna objective for latency/throughput measurement."""
def launch_eval(self):
"""Run evaluation on original and optimized models."""
def load_datasets(self):
"""Load evaluation (and calibration) datasets."""
def get_calibration_dataset(self) -> "Dataset":
"""Return the calibration dataset."""
def get_eval_dataset(self) -> "Dataset":
"""Return the evaluation dataset."""
def finalize(self):
"""Clean up intermediary files."""
class TimeBenchmark:
def __init__(
self,
model,
batch_size: int,
input_length: int,
model_input_names: Set[str],
warmup_runs: int,
duration: float,
):
"""Initialize time benchmark.
Args:
model: The model to benchmark.
batch_size: Batch size for inputs.
input_length: Sequence length for inputs.
model_input_names: Set of expected input names.
warmup_runs: Number of warmup iterations.
duration: Benchmark duration in seconds.
"""
def track(self):
"""Context manager to track a single forward pass latency."""
def execute(self) -> dict:
"""Run the full benchmark and return statistics."""
def to_dict(self) -> dict:
"""Return latency/throughput statistics as a dictionary."""
Import
from optimum.runs_base import Run, Calibrator, TimeBenchmark, task_processing_map
I/O Contract
Inputs (Run)
| Name |
Type |
Required |
Description
|
| run_config |
dict |
Yes |
Configuration dictionary validated against RunConfig schema
|
Outputs (Run.launch)
| Name |
Type |
Description
|
| return_body |
dict |
Complete benchmark results with evaluation.time and evaluation.others
|
Inputs (TimeBenchmark)
| Name |
Type |
Required |
Description
|
| model |
nn.Module |
Yes |
The model to benchmark
|
| batch_size |
int |
Yes |
Batch size for generated dummy inputs
|
| input_length |
int |
Yes |
Sequence length for generated dummy inputs
|
| model_input_names |
Set[str] |
Yes |
Expected model input names (e.g., input_ids, attention_mask)
|
| warmup_runs |
int |
Yes |
Number of warmup forward passes
|
| duration |
float |
Yes |
Benchmark duration in seconds
|
Outputs (TimeBenchmark.execute)
| Name |
Type |
Description
|
| stats |
dict |
Dictionary with nb_forwards, throughput, latency_mean, latency_std, and percentiles (50, 90, 95, 99, 999)
|
Usage Examples
Using TimeBenchmark
from optimum.runs_base import TimeBenchmark
# Assuming `model` is a loaded PyTorch model
benchmark = TimeBenchmark(
model=model,
batch_size=8,
input_length=128,
model_input_names={"input_ids", "attention_mask"},
warmup_runs=10,
duration=30.0,
)
stats = benchmark.execute()
print(f"Throughput: {stats['throughput']} samples/s")
print(f"Mean latency: {stats['latency_mean']:.2f} ms")
Related Pages