Principle:EvolvingLMMs Lab Lmms eval Queue Monitoring
| Knowledge Sources | |
|---|---|
| Domains | Server, Job_Management |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Managing evaluation job lifecycle with queue tracking, status monitoring, and cancellation capabilities.
Description
Queue Monitoring encompasses the set of operations that allow clients to observe and control the state of evaluation jobs after submission. The lmms-eval server maintains an in-memory registry of all jobs and provides HTTP endpoints for querying individual job status, retrieving aggregate queue statistics, and cancelling queued jobs.
Each job progresses through a well-defined state machine with five possible states:
- QUEUED: The job has been accepted and is waiting in the processing queue.
- RUNNING: The job is currently being executed by the background worker.
- COMPLETED: The evaluation finished successfully and results are available.
- FAILED: The evaluation encountered an error; the error message is stored on the job.
- CANCELLED: The job was cancelled by a client before execution began.
The monitoring system provides three complementary views:
- Individual Job Status (
GET /jobs/{job_id}): Returns fullJobInfoincluding timestamps for creation, start, and completion, the original evaluation request, the result dictionary (if completed), the error string (if failed), and the current queue position (if still queued). The queue position is dynamically recalculated at query time by counting how many queued jobs were created earlier.
- Queue Overview (
GET /queue): Returns aggregate statistics including the number of queued jobs, the ID of the currently running job (if any), a list of all queued job IDs, and counts of completed and failed jobs.
- Job Cancellation (
DELETE /jobs/{job_id}): Allows cancelling a job that is still in the QUEUED state. Running jobs cannot be cancelled. Attempting to cancel a completed, failed, or already-cancelled job returns an error.
Usage
Use the Queue Monitoring principle when you need to:
- Poll for the completion status of a submitted evaluation job
- Build dashboards or monitoring UIs that display queue depth and job progress
- Implement timeout logic that cancels jobs exceeding an expected duration in queue
- Audit the history of completed and failed evaluations
- Determine the current queue position to estimate wait time
Theoretical Basis
The Queue Monitoring design implements a job state machine with well-defined transitions:
Finite State Machine: The five-state JobStatus enum defines clear terminal states (COMPLETED, FAILED, CANCELLED) and non-terminal states (QUEUED, RUNNING). Transitions are enforced by the scheduler: only QUEUED jobs can move to RUNNING or CANCELLED, and only RUNNING jobs can move to COMPLETED or FAILED. This prevents invalid state transitions and ensures data consistency.
Async Lock Protection: All job state reads and writes are performed under an asyncio.Lock. This ensures that concurrent HTTP requests accessing the job registry do not observe partially-updated state. The lock is held briefly for dictionary lookups and status updates, minimizing contention.
Dynamic Position Calculation: Rather than maintaining a separate position counter that would need to be updated on every cancellation or completion, queue position is computed on demand by scanning the job registry. This trades a small computation cost at query time for simpler state management and fewer opportunities for position drift.
Automatic Cleanup: The scheduler automatically calls cleanup_old_jobs() after each job completes. This removes the oldest terminal jobs when the count exceeds max_completed_jobs, preventing unbounded memory growth in long-running server instances. Jobs are removed in order of their completed_at timestamp.
Cancellation Constraints: Running jobs cannot be cancelled because the evaluation executes in a subprocess. Terminating an in-flight subprocess could leave GPU resources in an inconsistent state. This design choice favors safety over flexibility, requiring the running job to complete or fail naturally.