Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:EvolvingLMMs Lab Lmms eval Server Mode Evaluation

From Leeroopedia
Revision as of 11:05, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/EvolvingLMMs_Lab_Lmms_eval_Server_Mode_Evaluation.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains LLMs, Multimodal_Evaluation, Infrastructure
Last Updated 2026-02-14 00:00 GMT

Overview

End-to-end process for running lmms-eval evaluations via the HTTP server mode, enabling remote job submission, queue management, and asynchronous result retrieval.

Description

This workflow covers the server-based evaluation mode of lmms-eval. Instead of running evaluations directly from the CLI, the server mode provides a FastAPI-based HTTP server that accepts evaluation jobs via REST API calls. Jobs are managed through a JobScheduler that processes requests sequentially, allowing multiple evaluations to be queued and executed without manual intervention. Clients can submit jobs, monitor their status, and retrieve results programmatically using the EvalClient Python class or direct HTTP calls.

Usage

Execute this workflow when you need to run evaluations as a service, integrate evaluation into a CI/CD pipeline, manage multiple evaluation jobs through a queue, or provide evaluation capabilities to remote users. This is particularly useful for teams that need centralized evaluation infrastructure or when evaluations are triggered programmatically from other systems.

Execution Steps

Step 1: Server Launch

Start the lmms-eval HTTP server using the launch_server entry point. The server binds to a configurable host and port, initializes the JobScheduler for queue management, and exposes REST API endpoints. The server uses FastAPI with uvicorn as the ASGI server. Configuration is provided through the ServerArgs dataclass which controls host, port, maximum completed jobs to retain, and temporary directory settings.

Key considerations:

  • The server is intended for trusted environments only; it has no built-in authentication
  • Default port is 8000; customize via ServerArgs
  • API documentation is auto-generated at /docs (Swagger UI)
  • The server runs the evaluation in the same process; ensure adequate GPU resources

Step 2: Job Submission

Submit evaluation jobs to the server via the POST /evaluate endpoint. The EvaluateRequest payload specifies the model, model_args, tasks, batch_size, and other evaluation parameters (mirroring the CLI arguments). Jobs are assigned unique IDs and placed in the processing queue. The response includes the job_id and queue position for tracking.

Key considerations:

  • Use the EvalClient Python class for convenient programmatic access
  • Jobs are processed sequentially (one at a time) by the scheduler
  • The same evaluation parameters available on the CLI are supported
  • Both synchronous (EvalClient) and asynchronous (AsyncEvalClient) clients are available

Step 3: Queue Monitoring

Monitor the evaluation queue and job status via the GET /queue and GET /jobs/{job_id} endpoints. The queue status shows all queued, running, completed, and failed jobs. Individual job status includes queue position, execution progress, and results when completed. The EvalClient provides a wait_for_job() method that polls until completion.

Key considerations:

  • Jobs transition through states: queued, running, completed, or failed
  • Queue position updates as preceding jobs complete
  • Failed jobs include error details for debugging
  • The GET /tasks and GET /models endpoints list available evaluation options

Step 4: Results Retrieval

Retrieve evaluation results from completed jobs via the GET /jobs/{job_id} endpoint. Results contain the same structured output as CLI evaluations: per-task metrics, configuration details, and optionally per-sample model outputs. The server retains completed job results up to the configured max_completed_jobs limit. Failed jobs report the error traceback.

Key considerations:

  • Results follow the same format as CLI output (JSON with results, configs, versions)
  • The server can also cancel queued jobs via DELETE /jobs/{job_id}
  • Running jobs cannot be cancelled
  • Results are stored in memory; restart clears history

Step 5: Client Integration

Integrate evaluation into external systems using the EvalClient or AsyncEvalClient. The client handles HTTP communication, job polling, and error handling. It supports submitting evaluations, checking job status, listing available tasks and models, and waiting for job completion with configurable polling intervals. The async client supports concurrent job management.

Key considerations:

  • EvalClient provides synchronous access with wait_for_job() for blocking patterns
  • AsyncEvalClient provides async/await support for concurrent workflows
  • Base URL defaults to http://localhost:8000; customize for remote servers
  • Timeout configuration prevents hanging on unresponsive servers

Execution Diagram

GitHub URL

Workflow Repository