Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:EvolvingLMMs Lab Lmms eval Process Results

From Leeroopedia
Knowledge Sources
Domains Evaluation, Metrics
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for processing model outputs through filters and computing evaluation metrics with aggregation, provided by the lmms-eval framework.

Description

The ConfigurableTask.process_results() method is the per-document scoring function that takes a single evaluation document and the model's output, then returns a dictionary mapping metric names to scores. It is decorated with a retry mechanism (@retry with up to 5 attempts and 1200-second timeout) to handle transient failures from external metric services like LLM-as-judge APIs.

The method supports three main output types:

  • generate_until -- Strips whitespace from generated text, then applies each metric function from the task's _metric_fn_list. For tasks with multiple reference answers (multiple_target), it computes the max score across all references.
  • multiple_choice -- Computes accuracy by finding the argmin of log-likelihoods across choices (lower loss = more likely). Supports accuracy, normalized accuracy (by completion length), exact match, mutual information scoring, F1, and MCC.
  • loglikelihood -- Returns accuracy (whether the greedy completion matches) and perplexity (the log probability).

If the task's YAML configuration defines a custom process_results callable, that function is invoked instead of the built-in logic.

The apply_filters() method runs filter ensembles on instances before scoring. The metrics.py module provides the full metric function and aggregation registry, including bootstrap standard error computation.

Usage

Use process_results() when:

  • The evaluator is computing per-document scores after model inference.
  • You are defining a custom task and need to understand the scoring contract.
  • You are debugging metric values by inspecting individual document scores.
  • You need to add a new metric to the registry.

Code Reference

Source Location

  • Repository: lmms-eval
  • File: lmms_eval/api/task.py (L1497-1649), lmms_eval/api/metrics.py (L1-951)

Signature

# Per-document scoring
class ConfigurableTask(Task):
    @retry(
        stop=(stop_after_attempt(5) | stop_after_delay(1200)),
        wait=wait_fixed(2),
    )
    def process_results(
        self,
        doc: dict,
        results: list,
        full_docs: Optional[dict] = None,
    ) -> dict: ...

# Filter application
class Task:
    def apply_filters(self) -> Optional[List[Instance]]: ...

# Metric aggregation with bootstrap stderr
def stderr_for_metric(
    metric: Callable,
    bootstrap_iters: int,
) -> Optional[Callable]: ...

def bootstrap_stderr(
    f: Callable,
    xs: list,
    iters: int,
) -> float: ...

Import

from lmms_eval.api.task import ConfigurableTask
from lmms_eval.api.metrics import bootstrap_stderr, stderr_for_metric
from lmms_eval.api.registry import register_metric, register_aggregation

I/O Contract

Inputs

Name Type Required Description
doc dict Yes The original evaluation document containing question, answer, and metadata fields
results list Yes Model output(s) -- list of strings for generation, or list of (logprob, is_greedy) tuples for loglikelihood
full_docs Optional[dict] No Full dataset passed for tasks that need cross-document information during scoring
bootstrap_iters int No Number of bootstrap iterations for stderr estimation (default: 100000)

Outputs

Name Type Description
result_dict dict Per-document dictionary mapping metric names to scores (e.g., {"exact_match": 1.0, "acc": 0.0})
aggregated_metrics dict After aggregation: task-level metric values with stderr (e.g., {"acc": 0.75, "acc_stderr": 0.04})

Usage Examples

Basic Example

# process_results is called by the evaluator for each document
doc = {"question": "What color is the sky?", "answer": "blue"}
results = ["blue"]

# For a generate_until task
scores = task.process_results(doc, results)
print(scores)
# {"exact_match": 1.0}

Registering a Custom Metric

from lmms_eval.api.registry import register_metric, register_aggregation

@register_aggregation("mean")
def mean(arr):
    return sum(arr) / len(arr)

@register_metric(
    metric="my_custom_metric",
    higher_is_better=True,
    output_type="generate_until",
    aggregation="mean",
)
def my_custom_metric_fn(references, predictions, **kwargs):
    # Custom scoring logic
    ref = references[0].lower().strip()
    pred = predictions[0].lower().strip()
    return {"my_custom_metric": 1.0 if ref in pred else 0.0}

Bootstrap Standard Error

from lmms_eval.api.metrics import bootstrap_stderr, mean

scores = [1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0]
stderr = bootstrap_stderr(mean, scores, iters=10000)
print(f"Mean: {mean(scores):.4f} +/- {stderr:.4f}")
# Mean: 0.6250 +/- 0.1768

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment