Principle:EvolvingLMMs Lab Lmms eval Results Output
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Logging |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Results output is the final stage of the evaluation pipeline where aggregated metrics and per-sample logs are persisted to disk, formatted for human consumption, and optionally uploaded to a remote hub for sharing and reproducibility.
Description
After post-processing and metric computation, the evaluation framework must persist results in a structured, machine-readable format while also producing human-readable summaries. Results output serves multiple audiences: developers debugging model quality, researchers comparing benchmarks, and automated systems consuming evaluation data.
The lmms-eval framework addresses this through two complementary systems:
EvaluationTracker -- A stateful logger that manages the lifecycle of evaluation metadata, aggregated results, and per-sample logs. It handles:
- Aggregated results -- A single JSON file containing task-level metrics, versions, configuration, timing information, and task hashes for reproducibility. The filename encodes the datetime of the run.
- Per-sample results -- JSONL files (one per task) containing individual document inputs, model responses, filtered responses, targets, and per-document metric scores.
- Hub upload -- Optional push to HuggingFace Hub datasets for sharing results publicly or privately, including metadata card generation.
make_table -- A utility function that renders the results dictionary into a formatted Markdown table suitable for terminal output and logging. The table automatically hides columns that contain only N/A values (e.g., stability metrics when num_samples=1).
Usage
Use results output whenever:
- You are completing an evaluation run and need to persist results for later analysis.
- You want to share evaluation results on HuggingFace Hub for public benchmarking.
- You need to inspect per-sample model outputs to diagnose failure modes.
- You are building a leaderboard or comparison dashboard from evaluation JSON files.
- You need a human-readable summary table of results for a report.
Theoretical Basis
The results output follows a structured logging pattern with these properties:
Reproducibility:
Each results file includes:
- Task configuration snapshots (
configsdictionary). - Task hashes computed from per-sample content hashes:
task_hash = sha256(concat(sample_hashes)). - Git commit hash of the evaluation code.
- Start/end timestamps and total evaluation time.
- Model name and source information.
Serialization Hierarchy:
output_path/
model_name_sanitized/
{datetime}_results.json # Aggregated metrics
{datetime}_samples_{task}.jsonl # Per-sample details (one per task)
Table Rendering:
The make_table() function produces a Markdown table with columns:
| Tasks | Version | Filter | n-shot | Metric | | Value | | Stderr |
Additional columns for CLT stderr, clustered stderr, stability metrics (EA, CA, IV, CR), and baseline comparison (Diff, CI, P_Value) are shown only when relevant data is present -- columns where all values are N/A are automatically hidden.
Hub Integration:
When push is enabled, the tracker:
- Creates a HuggingFace dataset repository if it does not exist.
- Uploads the results JSON file with a path like
model_name/{datetime}_results.json. - For per-sample results, uploads the entire output folder.
- Generates a metadata card with dataset configs pointing to each task's sample files.