Implementation:EvolvingLMMs Lab Lmms eval Job Results
| Knowledge Sources | |
|---|---|
| Domains | Server, Logging |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for retrieving completed evaluation job results with parsed output directory structure provided by the lmms-eval framework.
Description
When a job reaches a terminal state (COMPLETED or FAILED), the GET /jobs/{job_id} endpoint returns a JobInfo object whose result field contains parsed output file paths (for completed jobs) or whose error field contains the failure message (for failed jobs).
The result parsing is performed by JobScheduler._parse_output_directory(), a static method that scans the evaluation output directory. It iterates over model subdirectories, groups files by timestamp prefix, identifies *_results.json aggregate metric files and *_samples_*.jsonl per-sample prediction files, and selects the latest timestamp when multiple are present. The resulting dictionary maps model names to their output file paths.
The JobInfo model carries timestamps for creation, start, and completion, enabling clients to measure queue wait time and execution duration.
Usage
Use this implementation when you need to:
- Retrieve the output file paths from a completed evaluation
- Check whether a job has completed or failed
- Inspect error messages from failed evaluations
- Determine execution timing from job timestamps
Code Reference
Source Location
- Repository: lmms-eval
- File:
lmms_eval/entrypoints/protocol.py - Lines: L14-21 (JobStatus), L41-52 (JobInfo)
- File:
lmms_eval/entrypoints/job_scheduler.py - Lines: L241-248 (_complete_job), L376-418 (_parse_output_directory)
Signature
# JobInfo model (returned by GET /jobs/{job_id})
class JobInfo(BaseModel):
job_id: str
status: JobStatus
created_at: str
started_at: Optional[str] = None
completed_at: Optional[str] = None
request: EvaluateRequest
result: Optional[Dict[str, Any]] = None
error: Optional[str] = None
position_in_queue: Optional[int] = None
# Output directory parser
@staticmethod
def _parse_output_directory(output_path: str) -> Dict[str, Dict[str, Any]]:
"""
Parse output directory: output_path/model_name/YYYYMMDD_HHMMSS_results.json
Returns:
{model_name: {"results": path, "samples": [paths]}}
"""
# JobStatus enum
class JobStatus(str, Enum):
QUEUED = "queued"
RUNNING = "running"
COMPLETED = "completed"
FAILED = "failed"
CANCELLED = "cancelled"
Import
from lmms_eval.entrypoints.protocol import JobInfo, JobStatus
from lmms_eval.entrypoints.job_scheduler import JobScheduler
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| job_id | str |
Yes | UUID4 identifier of the job to query (path parameter) |
Outputs
JobInfo (completed job):
| Name | Type | Description |
|---|---|---|
| job_id | str |
The job's UUID4 identifier |
| status | "completed" |
Terminal status indicating successful completion |
| created_at | str |
ISO 8601 timestamp of job creation |
| started_at | str |
ISO 8601 timestamp when execution began |
| completed_at | str |
ISO 8601 timestamp when evaluation finished |
| request | EvaluateRequest |
Original evaluation parameters |
| result | Dict[str, Dict] |
Parsed output: {model_name: {"results": "/path/to/results.json", "samples": ["/path/to/samples.jsonl"]}}
|
| error | None |
Null for completed jobs |
| position_in_queue | None |
Null for terminal jobs |
JobInfo (failed job):
| Name | Type | Description |
|---|---|---|
| status | "failed" |
Terminal status indicating failure |
| result | None |
Null for failed jobs |
| error | str |
Error message describing the failure (e.g., "Evaluation failed with return code 1") |
Usage Examples
Basic Example
import httpx
job_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
response = httpx.get(f"http://localhost:8000/jobs/{job_id}")
job = response.json()
if job["status"] == "completed":
for model_name, output in job["result"].items():
print(f"Model: {model_name}")
print(f" Results: {output['results']}")
for sample_file in output["samples"]:
print(f" Samples: {sample_file}")
elif job["status"] == "failed":
print(f"Job failed: {job['error']}")
Timing Analysis Example
import httpx
from datetime import datetime
job_id = "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
response = httpx.get(f"http://localhost:8000/jobs/{job_id}")
job = response.json()
if job["status"] == "completed":
created = datetime.fromisoformat(job["created_at"])
started = datetime.fromisoformat(job["started_at"])
completed = datetime.fromisoformat(job["completed_at"])
queue_wait = (started - created).total_seconds()
execution_time = (completed - started).total_seconds()
print(f"Queue wait: {queue_wait:.1f}s, Execution: {execution_time:.1f}s")
Output Directory Structure
output_path/
model_name/
20260214_103045_results.json # Aggregate metrics
20260214_103045_samples_mmmu.jsonl # Per-sample predictions for mmmu
20260214_103045_samples_mme.jsonl # Per-sample predictions for mme