Implementation:EvolvingLMMs Lab Lmms eval EvaluationTracker Save
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Logging |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for persisting evaluation results to disk and optionally uploading to HuggingFace Hub, provided by the lmms-eval framework.
Description
The EvaluationTracker class manages the complete results persistence lifecycle. It is initialized with an output path and optional hub configuration, then used throughout the evaluation to log experiment parameters, save aggregated results, save per-sample logs, and optionally push everything to HuggingFace Hub.
The save_results_aggregated() method writes a JSON file containing aggregated metrics, task hashes, and evaluation metadata (model name, timing, system instruction). The filename is timestamped to support multiple runs without overwriting.
The save_results_samples() method writes per-task JSONL files where each line is a JSON object containing the input, model response, filtered response, target, and per-document metrics. The arguments and doc fields are removed from samples to reduce file size and avoid serializing large binary data (images, audio).
The make_table() utility function in lmms_eval/utils.py renders the results dictionary into a Markdown table with intelligent column hiding for optional metrics. It is used for terminal output during evaluation.
Usage
Use EvaluationTracker when:
- You are running an evaluation and want to persist results to disk.
- You want to upload results to HuggingFace Hub for sharing.
- You need to format results as a readable table for terminal output.
- You are building tooling that consumes evaluation JSON/JSONL outputs.
Code Reference
Source Location
- Repository: lmms-eval
- File:
lmms_eval/loggers/evaluation_tracker.py(L169-321),lmms_eval/utils.py(L528-678)
Signature
class EvaluationTracker:
def __init__(
self,
output_path: str = None,
hub_results_org: str = "",
hub_repo_name: str = "",
details_repo_name: str = "",
results_repo_name: str = "",
push_results_to_hub: bool = False,
push_samples_to_hub: bool = False,
public_repo: bool = False,
token: str = "",
leaderboard_url: str = "",
point_of_contact: str = "",
gated: bool = False,
) -> None: ...
def save_results_aggregated(
self,
results: dict,
samples: dict,
datetime_str: str,
) -> None:
"""Save aggregated results JSON and optionally push
to HuggingFace Hub."""
...
def save_results_samples(
self,
task_name: str,
samples: dict,
) -> None:
"""Save per-task sample JSONL and optionally push
to HuggingFace Hub."""
...
def make_table(
result_dict: dict,
column: str = "results",
sort_results: bool = False,
) -> str:
"""Generate Markdown table of evaluation results."""
...
Import
from lmms_eval.loggers.evaluation_tracker import EvaluationTracker
from lmms_eval.utils import make_table
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_path | str | No | Directory path for saving results files. If None, results are not saved to disk. |
| results | dict | Yes | Aggregated results dictionary containing results, versions, configs, n-shot, and higher_is_better keys
|
| samples | dict | Yes | Dictionary mapping task names to lists of per-sample result dicts |
| datetime_str | str | Yes | Datetime string for unique file naming (e.g., "20260214_120000")
|
| push_results_to_hub | bool | No | Whether to upload aggregated results to HuggingFace Hub (default: False) |
| push_samples_to_hub | bool | No | Whether to upload per-sample results to HuggingFace Hub (default: False) |
| hub_results_org | str | No | HuggingFace organization for the results dataset repository |
| token | str | No | HuggingFace API token with write access (required if pushing to hub) |
| result_dict | dict | Yes (for make_table) | Results dictionary to render as a table |
| column | str | No | Which key to use for table rows: "results" or "groups" (default: "results")
|
| sort_results | bool | No | Whether to sort table rows alphabetically (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| {datetime}_results.json | JSON file | Aggregated metrics, configs, task hashes, and evaluation metadata for all tasks |
| {datetime}_samples_{task}.jsonl | JSONL file | Per-document results for a specific task, one JSON object per line |
| Hub upload | HF Dataset | Optional upload of results and samples to a HuggingFace Hub dataset repository |
| Markdown table | str | Formatted table string returned by make_table() for terminal display
|
Usage Examples
Basic Example
from lmms_eval.loggers.evaluation_tracker import EvaluationTracker
from lmms_eval.utils import make_table
# Initialize tracker with local output
tracker = EvaluationTracker(output_path="./eval_results")
# Log experiment parameters
tracker.general_config_tracker.log_experiment_args(
model_source="hf",
model_args="pretrained=Qwen/Qwen2.5-VL-3B-Instruct",
system_instruction=None,
chat_template=None,
fewshot_as_multiturn=False,
)
# After evaluation completes, save aggregated results
tracker.save_results_aggregated(
results=results_dict,
samples=samples_dict,
datetime_str="20260214_120000",
)
# Save per-task sample logs
for task_name, task_samples in samples_dict.items():
tracker.save_results_samples(task_name, task_samples)
# Generate a human-readable table
table = make_table(results_dict)
print(table)
Push to HuggingFace Hub
from lmms_eval.loggers.evaluation_tracker import EvaluationTracker
tracker = EvaluationTracker(
output_path="./eval_results",
hub_results_org="my-org",
results_repo_name="lmms-eval-results",
details_repo_name="lmms-eval-details",
push_results_to_hub=True,
push_samples_to_hub=True,
public_repo=True,
token="hf_xxxxxxxxxxxxx",
)
# Results will be uploaded after save_results_aggregated
# and save_results_samples are called
make_table Output Format
from lmms_eval.utils import make_table
# Example result_dict structure
result_dict = {
"results": {
"mmmu_val": {
"acc,none": 0.4567,
"acc_stderr,none": 0.0234,
},
"mme": {
"score,none": 1823.5,
"score_stderr,none": 45.2,
},
},
"versions": {"mmmu_val": "Yaml", "mme": "Yaml"},
"n-shot": {"mmmu_val": 0, "mme": 0},
"higher_is_better": {
"mmmu_val": {"acc": True},
"mme": {"score": True},
},
}
table = make_table(result_dict)
print(table)
# | Tasks | Version | Filter | n-shot | Metric | | Value | | Stderr |
# |----------|---------|--------|--------|--------|-|--------|-|--------|
# | mme | Yaml | none | 0 | score |↑| 1823.5 |±| 45.2 |
# | mmmu_val | Yaml | none | 0 | acc |↑| 0.4567 |±| 0.0234 |