Implementation:EvolvingLMMs Lab Lmms eval EvaluationTracker Save

Knowledge Sources	lmms-eval
Domains	Evaluation, Logging
Last Updated	2026-02-14 00:00 GMT

Overview

Concrete tool for persisting evaluation results to disk and optionally uploading to HuggingFace Hub, provided by the lmms-eval framework.

Description

The EvaluationTracker class manages the complete results persistence lifecycle. It is initialized with an output path and optional hub configuration, then used throughout the evaluation to log experiment parameters, save aggregated results, save per-sample logs, and optionally push everything to HuggingFace Hub.

The save_results_aggregated() method writes a JSON file containing aggregated metrics, task hashes, and evaluation metadata (model name, timing, system instruction). The filename is timestamped to support multiple runs without overwriting.

The save_results_samples() method writes per-task JSONL files where each line is a JSON object containing the input, model response, filtered response, target, and per-document metrics. The arguments and doc fields are removed from samples to reduce file size and avoid serializing large binary data (images, audio).

The make_table() utility function in lmms_eval/utils.py renders the results dictionary into a Markdown table with intelligent column hiding for optional metrics. It is used for terminal output during evaluation.

Usage

Use EvaluationTracker when:

You are running an evaluation and want to persist results to disk.
You want to upload results to HuggingFace Hub for sharing.
You need to format results as a readable table for terminal output.
You are building tooling that consumes evaluation JSON/JSONL outputs.

Code Reference

Source Location

Repository: lmms-eval
File: lmms_eval/loggers/evaluation_tracker.py (L169-321), lmms_eval/utils.py (L528-678)

Signature

class EvaluationTracker:
    def __init__(
        self,
        output_path: str = None,
        hub_results_org: str = "",
        hub_repo_name: str = "",
        details_repo_name: str = "",
        results_repo_name: str = "",
        push_results_to_hub: bool = False,
        push_samples_to_hub: bool = False,
        public_repo: bool = False,
        token: str = "",
        leaderboard_url: str = "",
        point_of_contact: str = "",
        gated: bool = False,
    ) -> None: ...

    def save_results_aggregated(
        self,
        results: dict,
        samples: dict,
        datetime_str: str,
    ) -> None:
        """Save aggregated results JSON and optionally push
        to HuggingFace Hub."""
        ...

    def save_results_samples(
        self,
        task_name: str,
        samples: dict,
    ) -> None:
        """Save per-task sample JSONL and optionally push
        to HuggingFace Hub."""
        ...

def make_table(
    result_dict: dict,
    column: str = "results",
    sort_results: bool = False,
) -> str:
    """Generate Markdown table of evaluation results."""
    ...

Import

from lmms_eval.loggers.evaluation_tracker import EvaluationTracker
from lmms_eval.utils import make_table

I/O Contract

Inputs

Name	Type	Required	Description
output_path	str	No	Directory path for saving results files. If None, results are not saved to disk.
results	dict	Yes	Aggregated results dictionary containing `results`, `versions`, `configs`, `n-shot`, and `higher_is_better` keys
samples	dict	Yes	Dictionary mapping task names to lists of per-sample result dicts
datetime_str	str	Yes	Datetime string for unique file naming (e.g., `"20260214_120000"`)
push_results_to_hub	bool	No	Whether to upload aggregated results to HuggingFace Hub (default: False)
push_samples_to_hub	bool	No	Whether to upload per-sample results to HuggingFace Hub (default: False)
hub_results_org	str	No	HuggingFace organization for the results dataset repository
token	str	No	HuggingFace API token with write access (required if pushing to hub)
result_dict	dict	Yes (for make_table)	Results dictionary to render as a table
column	str	No	Which key to use for table rows: `"results"` or `"groups"` (default: "results")
sort_results	bool	No	Whether to sort table rows alphabetically (default: False)

Outputs

Name	Type	Description
{datetime}_results.json	JSON file	Aggregated metrics, configs, task hashes, and evaluation metadata for all tasks
{datetime}_samples_{task}.jsonl	JSONL file	Per-document results for a specific task, one JSON object per line
Hub upload	HF Dataset	Optional upload of results and samples to a HuggingFace Hub dataset repository
Markdown table	str	Formatted table string returned by `make_table()` for terminal display

Usage Examples

Basic Example

from lmms_eval.loggers.evaluation_tracker import EvaluationTracker
from lmms_eval.utils import make_table

# Initialize tracker with local output
tracker = EvaluationTracker(output_path="./eval_results")

# Log experiment parameters
tracker.general_config_tracker.log_experiment_args(
    model_source="hf",
    model_args="pretrained=Qwen/Qwen2.5-VL-3B-Instruct",
    system_instruction=None,
    chat_template=None,
    fewshot_as_multiturn=False,
)

# After evaluation completes, save aggregated results
tracker.save_results_aggregated(
    results=results_dict,
    samples=samples_dict,
    datetime_str="20260214_120000",
)

# Save per-task sample logs
for task_name, task_samples in samples_dict.items():
    tracker.save_results_samples(task_name, task_samples)

# Generate a human-readable table
table = make_table(results_dict)
print(table)

Push to HuggingFace Hub

from lmms_eval.loggers.evaluation_tracker import EvaluationTracker

tracker = EvaluationTracker(
    output_path="./eval_results",
    hub_results_org="my-org",
    results_repo_name="lmms-eval-results",
    details_repo_name="lmms-eval-details",
    push_results_to_hub=True,
    push_samples_to_hub=True,
    public_repo=True,
    token="hf_xxxxxxxxxxxxx",
)

# Results will be uploaded after save_results_aggregated
# and save_results_samples are called

make_table Output Format

from lmms_eval.utils import make_table

# Example result_dict structure
result_dict = {
    "results": {
        "mmmu_val": {
            "acc,none": 0.4567,
            "acc_stderr,none": 0.0234,
        },
        "mme": {
            "score,none": 1823.5,
            "score_stderr,none": 45.2,
        },
    },
    "versions": {"mmmu_val": "Yaml", "mme": "Yaml"},
    "n-shot": {"mmmu_val": 0, "mme": 0},
    "higher_is_better": {
        "mmmu_val": {"acc": True},
        "mme": {"score": True},
    },
}

table = make_table(result_dict)
print(table)
# | Tasks    | Version | Filter | n-shot | Metric | | Value  | | Stderr |
# |----------|---------|--------|--------|--------|-|--------|-|--------|
# | mme      | Yaml    | none   | 0      | score  |↑| 1823.5 |±| 45.2   |
# | mmmu_val | Yaml    | none   | 0      | acc    |↑| 0.4567 |±| 0.0234 |

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment