Principle:EvolvingLMMs Lab Lmms eval Metric Definition
| Knowledge Sources | |
|---|---|
| Domains | Metrics, Evaluation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Evaluation metrics must be explicitly defined with a scoring function, an aggregation strategy, and a directionality indicator so that benchmark results are computed consistently and comparably.
Description
Metrics are the final output of any evaluation pipeline. They transform per-sample model outputs into summary scores that characterize model performance. A well-designed metric system must support three concerns:
1. Per-sample scoring: A metric function receives individual sample results and returns a numeric value (or a structured value that will be reduced later). Built-in examples include exact_match (binary 0/1 comparison), acc (accuracy), perplexity (exponential of mean negative log-likelihood), and f1 (token-level F1 score).
2. Aggregation: After per-sample scores are computed across all documents, an aggregation function reduces them to a single benchmark-level score. Common aggregations include mean (arithmetic average), median, and bypass (a no-op that returns a sentinel value, useful when aggregation is handled by a custom process_results function). Corpus-level metrics like BLEU and chrF perform aggregation across all samples simultaneously rather than averaging per-sample scores.
3. Directionality: Each metric must declare whether a higher value indicates better performance (higher_is_better: true) or worse performance (higher_is_better: false, as with perplexity). This metadata is used for result reporting and comparison.
In lmms-eval, metrics are defined through a registry system. Built-in metrics are registered at module load time using the @register_metric decorator. Custom metrics can be defined in two ways:
- YAML inline: The
metric_listin a task YAML references registered metric names by string, with optional aggregation and higher_is_better overrides. - Custom functions: When
process_resultsis defined, the framework bypasses the standard per-sample metric functions and instead passes the results dict directly to the aggregation functions. The metric names inmetric_listbecome keys that routeprocess_resultsoutput to the correct aggregation function.
This two-tier system (registered built-in metrics vs. custom process_results + aggregation) provides flexibility: simple tasks can use standard metrics by name, while complex tasks can implement arbitrary scoring logic.
Usage
Use metric definition whenever you are configuring the scoring section of a task YAML. For tasks with straightforward scoring (e.g., exact string match), reference built-in metrics by name in metric_list. For tasks requiring complex per-sample logic (e.g., category-based scoring, pairwise evaluation), implement a custom process_results function that returns a dict keyed by metric name, and provide custom aggregation functions via !function references in the aggregation field of each metric entry.
Theoretical Basis
The metric computation pipeline follows a two-phase map-reduce pattern:
Phase 1 (Map): For each document d_i in the evaluation set:
result_i = process_results(d_i, model_output_i)
# result_i = {metric_name: value_i, ...}
Phase 2 (Reduce): For each metric m:
values_m = [result_i[m] for all i]
score_m = aggregation_m(values_m)
When process_results is not defined, the framework uses the standard metric functions:
Phase 1 (Map): For each document d_i:
For each metric m in metric_list:
value_i_m = metric_fn_m(model_output_i, ground_truth_i)
Phase 2 (Reduce): For each metric m:
score_m = aggregation_m([value_i_m for all i])
The registry maintains three parallel mappings:
METRIC_REGISTRY: metric_name -> scoring_function
METRIC_AGGREGATION_REGISTRY: metric_name -> aggregation_function
HIGHER_IS_BETTER_REGISTRY: metric_name -> bool
Default metric assignments per output type:
loglikelihood:["perplexity", "acc"]multiple_choice:["acc", "acc_norm"]generate_until:["exact_match"]