Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:EvolvingLMMs Lab Lmms eval Metric Definition

From Leeroopedia
Knowledge Sources
Domains Metrics, Evaluation
Last Updated 2026-02-14 00:00 GMT

Overview

Evaluation metrics must be explicitly defined with a scoring function, an aggregation strategy, and a directionality indicator so that benchmark results are computed consistently and comparably.

Description

Metrics are the final output of any evaluation pipeline. They transform per-sample model outputs into summary scores that characterize model performance. A well-designed metric system must support three concerns:

1. Per-sample scoring: A metric function receives individual sample results and returns a numeric value (or a structured value that will be reduced later). Built-in examples include exact_match (binary 0/1 comparison), acc (accuracy), perplexity (exponential of mean negative log-likelihood), and f1 (token-level F1 score).

2. Aggregation: After per-sample scores are computed across all documents, an aggregation function reduces them to a single benchmark-level score. Common aggregations include mean (arithmetic average), median, and bypass (a no-op that returns a sentinel value, useful when aggregation is handled by a custom process_results function). Corpus-level metrics like BLEU and chrF perform aggregation across all samples simultaneously rather than averaging per-sample scores.

3. Directionality: Each metric must declare whether a higher value indicates better performance (higher_is_better: true) or worse performance (higher_is_better: false, as with perplexity). This metadata is used for result reporting and comparison.

In lmms-eval, metrics are defined through a registry system. Built-in metrics are registered at module load time using the @register_metric decorator. Custom metrics can be defined in two ways:

  • YAML inline: The metric_list in a task YAML references registered metric names by string, with optional aggregation and higher_is_better overrides.
  • Custom functions: When process_results is defined, the framework bypasses the standard per-sample metric functions and instead passes the results dict directly to the aggregation functions. The metric names in metric_list become keys that route process_results output to the correct aggregation function.

This two-tier system (registered built-in metrics vs. custom process_results + aggregation) provides flexibility: simple tasks can use standard metrics by name, while complex tasks can implement arbitrary scoring logic.

Usage

Use metric definition whenever you are configuring the scoring section of a task YAML. For tasks with straightforward scoring (e.g., exact string match), reference built-in metrics by name in metric_list. For tasks requiring complex per-sample logic (e.g., category-based scoring, pairwise evaluation), implement a custom process_results function that returns a dict keyed by metric name, and provide custom aggregation functions via !function references in the aggregation field of each metric entry.

Theoretical Basis

The metric computation pipeline follows a two-phase map-reduce pattern:

Phase 1 (Map): For each document d_i in the evaluation set:
    result_i = process_results(d_i, model_output_i)
    # result_i = {metric_name: value_i, ...}

Phase 2 (Reduce): For each metric m:
    values_m = [result_i[m] for all i]
    score_m = aggregation_m(values_m)

When process_results is not defined, the framework uses the standard metric functions:

Phase 1 (Map): For each document d_i:
    For each metric m in metric_list:
        value_i_m = metric_fn_m(model_output_i, ground_truth_i)

Phase 2 (Reduce): For each metric m:
    score_m = aggregation_m([value_i_m for all i])

The registry maintains three parallel mappings:

METRIC_REGISTRY:             metric_name -> scoring_function
METRIC_AGGREGATION_REGISTRY: metric_name -> aggregation_function
HIGHER_IS_BETTER_REGISTRY:   metric_name -> bool

Default metric assignments per output type:

  • loglikelihood: ["perplexity", "acc"]
  • multiple_choice: ["acc", "acc_norm"]
  • generate_until: ["exact_match"]

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment