Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:EvolvingLMMs Lab Lmms eval Results Retrieval

From Leeroopedia
Knowledge Sources
Domains Server, Logging
Last Updated 2026-02-14 00:00 GMT

Overview

Retrieving completed evaluation job results with parsed output directory structure containing metrics and sample-level predictions.

Description

Results Retrieval is the process of accessing the outputs produced by a completed evaluation job. When an evaluation finishes successfully, the lmms-eval server parses the output directory to identify result files and makes them available through the job status endpoint. This allows clients to discover what files were generated without needing direct filesystem access to the server.

The results lifecycle has three phases:

  1. Output Generation: During evaluation, lmms-eval writes results to an output directory (either user-specified via output_dir or an auto-generated temporary directory). The directory follows a standard structure: output_path/model_name/YYYYMMDD_HHMMSS_results.json for aggregate metrics and output_path/model_name/YYYYMMDD_HHMMSS_samples_taskname.jsonl for per-sample predictions.
  1. Output Parsing: After the evaluation subprocess exits successfully, the scheduler calls _parse_output_directory() to scan the output directory. For each model subdirectory, files are grouped by timestamp prefix. The parser identifies *_results.json files (containing aggregate benchmark scores) and *_samples_*.jsonl files (containing individual prediction records). When multiple timestamps exist for the same model (from previous runs in the same directory), only the latest timestamp is used.
  1. Result Delivery: The parsed result dictionary is stored on the JobInfo record's result field. Clients querying GET /jobs/{job_id} for a completed job receive this dictionary, keyed by model name, with paths to the results JSON and an array of sample JSONL file paths.

For failed jobs, the error field on JobInfo contains the error message from the evaluation subprocess, while the result field remains None.

Usage

Use the Results Retrieval principle when you need to:

  • Obtain the file paths to evaluation results after a job completes
  • Determine which models and timestamps have results in the output directory
  • Build downstream processing pipelines that consume evaluation metrics and sample predictions
  • Diagnose failed evaluations by inspecting the stored error messages

Theoretical Basis

The Results Retrieval design follows a structured output discovery pattern:

Convention over Configuration: The output directory structure follows a fixed convention (model_name/timestamp_results.json) established by the lmms-eval framework. The parser relies on this convention rather than requiring explicit output registration, making it robust to changes in how many tasks or models produce output.

Timestamp-Based Deduplication: When an output directory is reused across multiple evaluation runs, the parser groups files by their timestamp prefix and selects only the latest. This prevents stale results from previous runs from contaminating the current job's output. A warning is logged when multiple timestamps are detected.

Path-Based Result References: Rather than embedding full result content in the API response, the server returns file paths. This keeps API responses lightweight and allows clients to selectively download only the files they need. For remote deployments, these paths would need to be served through a separate file-serving mechanism.

Terminal State Immutability: Once a job reaches COMPLETED or FAILED status, its result and error fields are set exactly once and never modified. The completed_at timestamp records when this transition occurred. This immutability simplifies reasoning about concurrent access and caching.

Subprocess Exit Code Validation: The scheduler checks the evaluation subprocess return code before parsing results. A non-zero exit code causes the job to be marked as FAILED with an error message, preventing the server from returning partial or corrupted results from an aborted evaluation.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment