Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:EvolvingLMMs Lab Lmms eval Job Results

From Leeroopedia
Knowledge Sources
Domains Server, Logging
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for retrieving completed evaluation job results with parsed output directory structure provided by the lmms-eval framework.

Description

When a job reaches a terminal state (COMPLETED or FAILED), the GET /jobs/{job_id} endpoint returns a JobInfo object whose result field contains parsed output file paths (for completed jobs) or whose error field contains the failure message (for failed jobs).

The result parsing is performed by JobScheduler._parse_output_directory(), a static method that scans the evaluation output directory. It iterates over model subdirectories, groups files by timestamp prefix, identifies *_results.json aggregate metric files and *_samples_*.jsonl per-sample prediction files, and selects the latest timestamp when multiple are present. The resulting dictionary maps model names to their output file paths.

The JobInfo model carries timestamps for creation, start, and completion, enabling clients to measure queue wait time and execution duration.

Usage

Use this implementation when you need to:

  • Retrieve the output file paths from a completed evaluation
  • Check whether a job has completed or failed
  • Inspect error messages from failed evaluations
  • Determine execution timing from job timestamps

Code Reference

Source Location

  • Repository: lmms-eval
  • File: lmms_eval/entrypoints/protocol.py
  • Lines: L14-21 (JobStatus), L41-52 (JobInfo)
  • File: lmms_eval/entrypoints/job_scheduler.py
  • Lines: L241-248 (_complete_job), L376-418 (_parse_output_directory)

Signature

# JobInfo model (returned by GET /jobs/{job_id})
class JobInfo(BaseModel):
    job_id: str
    status: JobStatus
    created_at: str
    started_at: Optional[str] = None
    completed_at: Optional[str] = None
    request: EvaluateRequest
    result: Optional[Dict[str, Any]] = None
    error: Optional[str] = None
    position_in_queue: Optional[int] = None

# Output directory parser
@staticmethod
def _parse_output_directory(output_path: str) -> Dict[str, Dict[str, Any]]:
    """
    Parse output directory: output_path/model_name/YYYYMMDD_HHMMSS_results.json

    Returns:
        {model_name: {"results": path, "samples": [paths]}}
    """

# JobStatus enum
class JobStatus(str, Enum):
    QUEUED = "queued"
    RUNNING = "running"
    COMPLETED = "completed"
    FAILED = "failed"
    CANCELLED = "cancelled"

Import

from lmms_eval.entrypoints.protocol import JobInfo, JobStatus
from lmms_eval.entrypoints.job_scheduler import JobScheduler

I/O Contract

Inputs

Name Type Required Description
job_id str Yes UUID4 identifier of the job to query (path parameter)

Outputs

JobInfo (completed job):

Name Type Description
job_id str The job's UUID4 identifier
status "completed" Terminal status indicating successful completion
created_at str ISO 8601 timestamp of job creation
started_at str ISO 8601 timestamp when execution began
completed_at str ISO 8601 timestamp when evaluation finished
request EvaluateRequest Original evaluation parameters
result Dict[str, Dict] Parsed output: {model_name: {"results": "/path/to/results.json", "samples": ["/path/to/samples.jsonl"]}}
error None Null for completed jobs
position_in_queue None Null for terminal jobs

JobInfo (failed job):

Name Type Description
status "failed" Terminal status indicating failure
result None Null for failed jobs
error str Error message describing the failure (e.g., "Evaluation failed with return code 1")

Usage Examples

Basic Example

import httpx

job_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
response = httpx.get(f"http://localhost:8000/jobs/{job_id}")
job = response.json()

if job["status"] == "completed":
    for model_name, output in job["result"].items():
        print(f"Model: {model_name}")
        print(f"  Results: {output['results']}")
        for sample_file in output["samples"]:
            print(f"  Samples: {sample_file}")
elif job["status"] == "failed":
    print(f"Job failed: {job['error']}")

Timing Analysis Example

import httpx
from datetime import datetime

job_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
response = httpx.get(f"http://localhost:8000/jobs/{job_id}")
job = response.json()

if job["status"] == "completed":
    created = datetime.fromisoformat(job["created_at"])
    started = datetime.fromisoformat(job["started_at"])
    completed = datetime.fromisoformat(job["completed_at"])

    queue_wait = (started - created).total_seconds()
    execution_time = (completed - started).total_seconds()
    print(f"Queue wait: {queue_wait:.1f}s, Execution: {execution_time:.1f}s")

Output Directory Structure

output_path/
  model_name/
    20260214_103045_results.json       # Aggregate metrics
    20260214_103045_samples_mmmu.jsonl  # Per-sample predictions for mmmu
    20260214_103045_samples_mme.jsonl   # Per-sample predictions for mme

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment