Implementation:EvolvingLMMs Lab Lmms eval Process Results
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for processing model outputs through filters and computing evaluation metrics with aggregation, provided by the lmms-eval framework.
Description
The ConfigurableTask.process_results() method is the per-document scoring function that takes a single evaluation document and the model's output, then returns a dictionary mapping metric names to scores. It is decorated with a retry mechanism (@retry with up to 5 attempts and 1200-second timeout) to handle transient failures from external metric services like LLM-as-judge APIs.
The method supports three main output types:
- generate_until -- Strips whitespace from generated text, then applies each metric function from the task's
_metric_fn_list. For tasks with multiple reference answers (multiple_target), it computes the max score across all references. - multiple_choice -- Computes accuracy by finding the argmin of log-likelihoods across choices (lower loss = more likely). Supports accuracy, normalized accuracy (by completion length), exact match, mutual information scoring, F1, and MCC.
- loglikelihood -- Returns accuracy (whether the greedy completion matches) and perplexity (the log probability).
If the task's YAML configuration defines a custom process_results callable, that function is invoked instead of the built-in logic.
The apply_filters() method runs filter ensembles on instances before scoring. The metrics.py module provides the full metric function and aggregation registry, including bootstrap standard error computation.
Usage
Use process_results() when:
- The evaluator is computing per-document scores after model inference.
- You are defining a custom task and need to understand the scoring contract.
- You are debugging metric values by inspecting individual document scores.
- You need to add a new metric to the registry.
Code Reference
Source Location
- Repository: lmms-eval
- File:
lmms_eval/api/task.py(L1497-1649),lmms_eval/api/metrics.py(L1-951)
Signature
# Per-document scoring
class ConfigurableTask(Task):
@retry(
stop=(stop_after_attempt(5) | stop_after_delay(1200)),
wait=wait_fixed(2),
)
def process_results(
self,
doc: dict,
results: list,
full_docs: Optional[dict] = None,
) -> dict: ...
# Filter application
class Task:
def apply_filters(self) -> Optional[List[Instance]]: ...
# Metric aggregation with bootstrap stderr
def stderr_for_metric(
metric: Callable,
bootstrap_iters: int,
) -> Optional[Callable]: ...
def bootstrap_stderr(
f: Callable,
xs: list,
iters: int,
) -> float: ...
Import
from lmms_eval.api.task import ConfigurableTask
from lmms_eval.api.metrics import bootstrap_stderr, stderr_for_metric
from lmms_eval.api.registry import register_metric, register_aggregation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| doc | dict | Yes | The original evaluation document containing question, answer, and metadata fields |
| results | list | Yes | Model output(s) -- list of strings for generation, or list of (logprob, is_greedy) tuples for loglikelihood |
| full_docs | Optional[dict] | No | Full dataset passed for tasks that need cross-document information during scoring |
| bootstrap_iters | int | No | Number of bootstrap iterations for stderr estimation (default: 100000) |
Outputs
| Name | Type | Description |
|---|---|---|
| result_dict | dict | Per-document dictionary mapping metric names to scores (e.g., {"exact_match": 1.0, "acc": 0.0})
|
| aggregated_metrics | dict | After aggregation: task-level metric values with stderr (e.g., {"acc": 0.75, "acc_stderr": 0.04})
|
Usage Examples
Basic Example
# process_results is called by the evaluator for each document
doc = {"question": "What color is the sky?", "answer": "blue"}
results = ["blue"]
# For a generate_until task
scores = task.process_results(doc, results)
print(scores)
# {"exact_match": 1.0}
Registering a Custom Metric
from lmms_eval.api.registry import register_metric, register_aggregation
@register_aggregation("mean")
def mean(arr):
return sum(arr) / len(arr)
@register_metric(
metric="my_custom_metric",
higher_is_better=True,
output_type="generate_until",
aggregation="mean",
)
def my_custom_metric_fn(references, predictions, **kwargs):
# Custom scoring logic
ref = references[0].lower().strip()
pred = predictions[0].lower().strip()
return {"my_custom_metric": 1.0 if ref in pred else 0.0}
Bootstrap Standard Error
from lmms_eval.api.metrics import bootstrap_stderr, mean
scores = [1.0, 0.0, 1.0, 1.0, 0.0, 1.0, 0.0, 1.0]
stderr = bootstrap_stderr(mean, scores, iters=10000)
print(f"Mean: {mean(scores):.4f} +/- {stderr:.4f}")
# Mean: 0.6250 +/- 0.1768