Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:EvolvingLMMs Lab Lmms eval Results Output

From Leeroopedia
Knowledge Sources
Domains Evaluation, Logging
Last Updated 2026-02-14 00:00 GMT

Overview

Results output is the final stage of the evaluation pipeline where aggregated metrics and per-sample logs are persisted to disk, formatted for human consumption, and optionally uploaded to a remote hub for sharing and reproducibility.

Description

After post-processing and metric computation, the evaluation framework must persist results in a structured, machine-readable format while also producing human-readable summaries. Results output serves multiple audiences: developers debugging model quality, researchers comparing benchmarks, and automated systems consuming evaluation data.

The lmms-eval framework addresses this through two complementary systems:

EvaluationTracker -- A stateful logger that manages the lifecycle of evaluation metadata, aggregated results, and per-sample logs. It handles:

  • Aggregated results -- A single JSON file containing task-level metrics, versions, configuration, timing information, and task hashes for reproducibility. The filename encodes the datetime of the run.
  • Per-sample results -- JSONL files (one per task) containing individual document inputs, model responses, filtered responses, targets, and per-document metric scores.
  • Hub upload -- Optional push to HuggingFace Hub datasets for sharing results publicly or privately, including metadata card generation.

make_table -- A utility function that renders the results dictionary into a formatted Markdown table suitable for terminal output and logging. The table automatically hides columns that contain only N/A values (e.g., stability metrics when num_samples=1).

Usage

Use results output whenever:

  • You are completing an evaluation run and need to persist results for later analysis.
  • You want to share evaluation results on HuggingFace Hub for public benchmarking.
  • You need to inspect per-sample model outputs to diagnose failure modes.
  • You are building a leaderboard or comparison dashboard from evaluation JSON files.
  • You need a human-readable summary table of results for a report.

Theoretical Basis

The results output follows a structured logging pattern with these properties:

Reproducibility:

Each results file includes:

  • Task configuration snapshots (configs dictionary).
  • Task hashes computed from per-sample content hashes: task_hash = sha256(concat(sample_hashes)).
  • Git commit hash of the evaluation code.
  • Start/end timestamps and total evaluation time.
  • Model name and source information.

Serialization Hierarchy:

output_path/
  model_name_sanitized/
    {datetime}_results.json       # Aggregated metrics
    {datetime}_samples_{task}.jsonl  # Per-sample details (one per task)

Table Rendering:

The make_table() function produces a Markdown table with columns:

| Tasks | Version | Filter | n-shot | Metric | | Value | | Stderr |

Additional columns for CLT stderr, clustered stderr, stability metrics (EA, CA, IV, CR), and baseline comparison (Diff, CI, P_Value) are shown only when relevant data is present -- columns where all values are N/A are automatically hidden.

Hub Integration:

When push is enabled, the tracker:

  1. Creates a HuggingFace dataset repository if it does not exist.
  2. Uploads the results JSON file with a path like model_name/{datetime}_results.json.
  3. For per-sample results, uploads the entire output folder.
  4. Generates a metadata card with dataset configs pointing to each task's sample files.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment