Implementation:EvolvingLMMs Lab Lmms eval Process Results

Knowledge Sources	lmms-eval
Domains	Evaluation, Metrics
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for processing model outputs through filters and computing evaluation metrics with aggregation, provided by the lmms-eval framework.

Description

The ConfigurableTask.process_results() method is the per-document scoring function that takes a single evaluation document and the model's output, then returns a dictionary mapping metric names to scores. It is decorated with a retry mechanism (@retry with up to 5 attempts and 1200-second timeout) to handle transient failures from external metric services like LLM-as-judge APIs.

The method supports three main output types:

generate_until -- Strips whitespace from generated text, then applies each metric function from the task's _metric_fn_list. For tasks with multiple reference answers (multiple_target), it computes the max score across all references.
multiple_choice -- Computes accuracy by finding the argmin of log-likelihoods across choices (lower loss = more likely). Supports accuracy, normalized accuracy (by completion length), exact match, mutual information scoring, F1, and MCC.
loglikelihood -- Returns accuracy (whether the greedy completion matches) and perplexity (the log probability).

If the task's YAML configuration defines a custom process_results callable, that function is invoked instead of the built-in logic.

The apply_filters() method runs filter ensembles on instances before scoring. The metrics.py module provides the full metric function and aggregation registry, including bootstrap standard error computation.

Usage

Use process_results() when:

The evaluator is computing per-document scores after model inference.
You are defining a custom task and need to understand the scoring contract.
You are debugging metric values by inspecting individual document scores.
You need to add a new metric to the registry.

Code Reference

Source Location

Repository: lmms-eval
File: lmms_eval/api/task.py (L1497-1649), lmms_eval/api/metrics.py (L1-951)

Signature

# Per-document scoring
class ConfigurableTask(Task):
    @retry(
        stop=(stop_after_attempt(5) | stop_after_delay(1200)),
        wait=wait_fixed(2),
    )
    def process_results(
        self,
        doc: dict,
        results: list,
        full_docs: Optional[dict] = None,
    ) -> dict: ...

# Filter application
class Task:
    def apply_filters(self) -> Optional[List[Instance]]: ...

# Metric aggregation with bootstrap stderr
def stderr_for_metric(
    metric: Callable,
    bootstrap_iters: int,
) -> Optional[Callable]: ...

def bootstrap_stderr(
    f: Callable,
    xs: list,
    iters: int,
) -> float: ...

Import

from lmms_eval.api.task import ConfigurableTask
from lmms_eval.api.metrics import bootstrap_stderr, stderr_for_metric
from lmms_eval.api.registry import register_metric, register_aggregation

I/O Contract

Inputs

Name	Type	Required	Description
doc	dict	Yes	The original evaluation document containing question, answer, and metadata fields
results	list	Yes	Model output(s) -- list of strings for generation, or list of (logprob, is_greedy) tuples for loglikelihood
full_docs	Optional[dict]	No	Full dataset passed for tasks that need cross-document information during scoring
bootstrap_iters	int	No	Number of bootstrap iterations for stderr estimation (default: 100000)

Outputs

Name	Type	Description
result_dict	dict	Per-document dictionary mapping metric names to scores (e.g., `{"exact_match": 1.0, "acc": 0.0}`)
aggregated_metrics	dict	After aggregation: task-level metric values with stderr (e.g., `{"acc": 0.75, "acc_stderr": 0.04}`)

Usage Examples

Basic Example

# process_results is called by the evaluator for each document
doc = {"question": "What color is the sky?", "answer": "blue"}
results = ["blue"]

# For a generate_until task
scores = task.process_results(doc, results)
print(scores)
# {"exact_match": 1.0}

Registering a Custom Metric

from lmms_eval.api.registry import register_metric, register_aggregation

@register_aggregation("mean")
def mean(arr):
    return sum(arr) / len(arr)

@register_metric(
    metric="my_custom_metric",
    higher_is_better=True,
    output_type="generate_until",
    aggregation="mean",
)
def my_custom_metric_fn(references, predictions, **kwargs):
    # Custom scoring logic
    ref = references[0].lower().strip()
    pred = predictions[0].lower().strip()
    return {"my_custom_metric": 1.0 if ref in pred else 0.0}

Bootstrap Standard Error

from lmms_eval.api.metrics import bootstrap_stderr, mean

scores = [1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0]
stderr = bootstrap_stderr(mean, scores, iters=10000)
print(f"Mean: {mean(scores):.4f} +/- {stderr:.4f}")
# Mean: 0.6250 +/- 0.1768

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment