Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:EvolvingLMMs Lab Lmms eval EvaluationTracker Save

From Leeroopedia
Knowledge Sources
Domains Evaluation, Logging
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for persisting evaluation results to disk and optionally uploading to HuggingFace Hub, provided by the lmms-eval framework.

Description

The EvaluationTracker class manages the complete results persistence lifecycle. It is initialized with an output path and optional hub configuration, then used throughout the evaluation to log experiment parameters, save aggregated results, save per-sample logs, and optionally push everything to HuggingFace Hub.

The save_results_aggregated() method writes a JSON file containing aggregated metrics, task hashes, and evaluation metadata (model name, timing, system instruction). The filename is timestamped to support multiple runs without overwriting.

The save_results_samples() method writes per-task JSONL files where each line is a JSON object containing the input, model response, filtered response, target, and per-document metrics. The arguments and doc fields are removed from samples to reduce file size and avoid serializing large binary data (images, audio).

The make_table() utility function in lmms_eval/utils.py renders the results dictionary into a Markdown table with intelligent column hiding for optional metrics. It is used for terminal output during evaluation.

Usage

Use EvaluationTracker when:

  • You are running an evaluation and want to persist results to disk.
  • You want to upload results to HuggingFace Hub for sharing.
  • You need to format results as a readable table for terminal output.
  • You are building tooling that consumes evaluation JSON/JSONL outputs.

Code Reference

Source Location

  • Repository: lmms-eval
  • File: lmms_eval/loggers/evaluation_tracker.py (L169-321), lmms_eval/utils.py (L528-678)

Signature

class EvaluationTracker:
    def __init__(
        self,
        output_path: str = None,
        hub_results_org: str = "",
        hub_repo_name: str = "",
        details_repo_name: str = "",
        results_repo_name: str = "",
        push_results_to_hub: bool = False,
        push_samples_to_hub: bool = False,
        public_repo: bool = False,
        token: str = "",
        leaderboard_url: str = "",
        point_of_contact: str = "",
        gated: bool = False,
    ) -> None: ...

    def save_results_aggregated(
        self,
        results: dict,
        samples: dict,
        datetime_str: str,
    ) -> None:
        """Save aggregated results JSON and optionally push
        to HuggingFace Hub."""
        ...

    def save_results_samples(
        self,
        task_name: str,
        samples: dict,
    ) -> None:
        """Save per-task sample JSONL and optionally push
        to HuggingFace Hub."""
        ...

def make_table(
    result_dict: dict,
    column: str = "results",
    sort_results: bool = False,
) -> str:
    """Generate Markdown table of evaluation results."""
    ...

Import

from lmms_eval.loggers.evaluation_tracker import EvaluationTracker
from lmms_eval.utils import make_table

I/O Contract

Inputs

Name Type Required Description
output_path str No Directory path for saving results files. If None, results are not saved to disk.
results dict Yes Aggregated results dictionary containing results, versions, configs, n-shot, and higher_is_better keys
samples dict Yes Dictionary mapping task names to lists of per-sample result dicts
datetime_str str Yes Datetime string for unique file naming (e.g., "20260214_120000")
push_results_to_hub bool No Whether to upload aggregated results to HuggingFace Hub (default: False)
push_samples_to_hub bool No Whether to upload per-sample results to HuggingFace Hub (default: False)
hub_results_org str No HuggingFace organization for the results dataset repository
token str No HuggingFace API token with write access (required if pushing to hub)
result_dict dict Yes (for make_table) Results dictionary to render as a table
column str No Which key to use for table rows: "results" or "groups" (default: "results")
sort_results bool No Whether to sort table rows alphabetically (default: False)

Outputs

Name Type Description
{datetime}_results.json JSON file Aggregated metrics, configs, task hashes, and evaluation metadata for all tasks
{datetime}_samples_{task}.jsonl JSONL file Per-document results for a specific task, one JSON object per line
Hub upload HF Dataset Optional upload of results and samples to a HuggingFace Hub dataset repository
Markdown table str Formatted table string returned by make_table() for terminal display

Usage Examples

Basic Example

from lmms_eval.loggers.evaluation_tracker import EvaluationTracker
from lmms_eval.utils import make_table

# Initialize tracker with local output
tracker = EvaluationTracker(output_path="./eval_results")

# Log experiment parameters
tracker.general_config_tracker.log_experiment_args(
    model_source="hf",
    model_args="pretrained=Qwen/Qwen2.5-VL-3B-Instruct",
    system_instruction=None,
    chat_template=None,
    fewshot_as_multiturn=False,
)

# After evaluation completes, save aggregated results
tracker.save_results_aggregated(
    results=results_dict,
    samples=samples_dict,
    datetime_str="20260214_120000",
)

# Save per-task sample logs
for task_name, task_samples in samples_dict.items():
    tracker.save_results_samples(task_name, task_samples)

# Generate a human-readable table
table = make_table(results_dict)
print(table)

Push to HuggingFace Hub

from lmms_eval.loggers.evaluation_tracker import EvaluationTracker

tracker = EvaluationTracker(
    output_path="./eval_results",
    hub_results_org="my-org",
    results_repo_name="lmms-eval-results",
    details_repo_name="lmms-eval-details",
    push_results_to_hub=True,
    push_samples_to_hub=True,
    public_repo=True,
    token="hf_xxxxxxxxxxxxx",
)

# Results will be uploaded after save_results_aggregated
# and save_results_samples are called

make_table Output Format

from lmms_eval.utils import make_table

# Example result_dict structure
result_dict = {
    "results": {
        "mmmu_val": {
            "acc,none": 0.4567,
            "acc_stderr,none": 0.0234,
        },
        "mme": {
            "score,none": 1823.5,
            "score_stderr,none": 45.2,
        },
    },
    "versions": {"mmmu_val": "Yaml", "mme": "Yaml"},
    "n-shot": {"mmmu_val": 0, "mme": 0},
    "higher_is_better": {
        "mmmu_val": {"acc": True},
        "mme": {"score": True},
    },
}

table = make_table(result_dict)
print(table)
# | Tasks    | Version | Filter | n-shot | Metric | | Value  | | Stderr |
# |----------|---------|--------|--------|--------|-|--------|-|--------|
# | mme      | Yaml    | none   | 0      | score  |↑| 1823.5 |±| 45.2   |
# | mmmu_val | Yaml    | none   | 0      | acc    |↑| 0.4567 |±| 0.0234 |

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment