Principle:EvolvingLMMs Lab Lmms eval Results Output

Knowledge Sources	lmms-eval
Domains	Evaluation, Logging
Last Updated	2026-02-14 00:00 GMT

Overview

Results output is the final stage of the evaluation pipeline where aggregated metrics and per-sample logs are persisted to disk, formatted for human consumption, and optionally uploaded to a remote hub for sharing and reproducibility.

Description

After post-processing and metric computation, the evaluation framework must persist results in a structured, machine-readable format while also producing human-readable summaries. Results output serves multiple audiences: developers debugging model quality, researchers comparing benchmarks, and automated systems consuming evaluation data.

The lmms-eval framework addresses this through two complementary systems:

EvaluationTracker -- A stateful logger that manages the lifecycle of evaluation metadata, aggregated results, and per-sample logs. It handles:

Aggregated results -- A single JSON file containing task-level metrics, versions, configuration, timing information, and task hashes for reproducibility. The filename encodes the datetime of the run.
Per-sample results -- JSONL files (one per task) containing individual document inputs, model responses, filtered responses, targets, and per-document metric scores.
Hub upload -- Optional push to HuggingFace Hub datasets for sharing results publicly or privately, including metadata card generation.

make_table -- A utility function that renders the results dictionary into a formatted Markdown table suitable for terminal output and logging. The table automatically hides columns that contain only N/A values (e.g., stability metrics when num_samples=1).

Usage

Use results output whenever:

You are completing an evaluation run and need to persist results for later analysis.
You want to share evaluation results on HuggingFace Hub for public benchmarking.
You need to inspect per-sample model outputs to diagnose failure modes.
You are building a leaderboard or comparison dashboard from evaluation JSON files.
You need a human-readable summary table of results for a report.

Theoretical Basis

The results output follows a structured logging pattern with these properties:

Reproducibility:

Each results file includes:

Task configuration snapshots (configs dictionary).
Task hashes computed from per-sample content hashes: task_hash = sha256(concat(sample_hashes)).
Git commit hash of the evaluation code.
Start/end timestamps and total evaluation time.
Model name and source information.

Serialization Hierarchy:

output_path/
  model_name_sanitized/
    {datetime}_results.json       # Aggregated metrics
    {datetime}_samples_{task}.jsonl  # Per-sample details (one per task)

Table Rendering:

The make_table() function produces a Markdown table with columns:

| Tasks | Version | Filter | n-shot | Metric | | Value | | Stderr |

Additional columns for CLT stderr, clustered stderr, stability metrics (EA, CA, IV, CR), and baseline comparison (Diff, CI, P_Value) are shown only when relevant data is present -- columns where all values are N/A are automatically hidden.

Hub Integration:

When push is enabled, the tracker:

Creates a HuggingFace dataset repository if it does not exist.
Uploads the results JSON file with a path like model_name/{datetime}_results.json.
For per-sample results, uploads the entire output folder.
Generates a metadata card with dataset configs pointing to each task's sample files.

Related Pages

Implemented By

Implementation:EvolvingLMMs_Lab_Lmms_eval_EvaluationTracker_Save

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment