Principle:EvolvingLMMs Lab Lmms eval Results Retrieval
| Knowledge Sources | |
|---|---|
| Domains | Server, Logging |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Retrieving completed evaluation job results with parsed output directory structure containing metrics and sample-level predictions.
Description
Results Retrieval is the process of accessing the outputs produced by a completed evaluation job. When an evaluation finishes successfully, the lmms-eval server parses the output directory to identify result files and makes them available through the job status endpoint. This allows clients to discover what files were generated without needing direct filesystem access to the server.
The results lifecycle has three phases:
- Output Generation: During evaluation, lmms-eval writes results to an output directory (either user-specified via
output_diror an auto-generated temporary directory). The directory follows a standard structure:output_path/model_name/YYYYMMDD_HHMMSS_results.jsonfor aggregate metrics andoutput_path/model_name/YYYYMMDD_HHMMSS_samples_taskname.jsonlfor per-sample predictions.
- Output Parsing: After the evaluation subprocess exits successfully, the scheduler calls
_parse_output_directory()to scan the output directory. For each model subdirectory, files are grouped by timestamp prefix. The parser identifies*_results.jsonfiles (containing aggregate benchmark scores) and*_samples_*.jsonlfiles (containing individual prediction records). When multiple timestamps exist for the same model (from previous runs in the same directory), only the latest timestamp is used.
- Result Delivery: The parsed result dictionary is stored on the
JobInforecord'sresultfield. Clients queryingGET /jobs/{job_id}for a completed job receive this dictionary, keyed by model name, with paths to the results JSON and an array of sample JSONL file paths.
For failed jobs, the error field on JobInfo contains the error message from the evaluation subprocess, while the result field remains None.
Usage
Use the Results Retrieval principle when you need to:
- Obtain the file paths to evaluation results after a job completes
- Determine which models and timestamps have results in the output directory
- Build downstream processing pipelines that consume evaluation metrics and sample predictions
- Diagnose failed evaluations by inspecting the stored error messages
Theoretical Basis
The Results Retrieval design follows a structured output discovery pattern:
Convention over Configuration: The output directory structure follows a fixed convention (model_name/timestamp_results.json) established by the lmms-eval framework. The parser relies on this convention rather than requiring explicit output registration, making it robust to changes in how many tasks or models produce output.
Timestamp-Based Deduplication: When an output directory is reused across multiple evaluation runs, the parser groups files by their timestamp prefix and selects only the latest. This prevents stale results from previous runs from contaminating the current job's output. A warning is logged when multiple timestamps are detected.
Path-Based Result References: Rather than embedding full result content in the API response, the server returns file paths. This keeps API responses lightweight and allows clients to selectively download only the files they need. For remote deployments, these paths would need to be served through a separate file-serving mechanism.
Terminal State Immutability: Once a job reaches COMPLETED or FAILED status, its result and error fields are set exactly once and never modified. The completed_at timestamp records when this transition occurred. This immutability simplifies reasoning about concurrent access and caching.
Subprocess Exit Code Validation: The scheduler checks the evaluation subprocess return code before parsing results. A non-zero exit code causes the job to be marked as FAILED with an error message, preventing the server from returning partial or corrupted results from an aborted evaluation.