Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:EvolvingLMMs Lab Lmms eval End to End Evaluation

From Leeroopedia
Knowledge Sources
Domains LLMs, Multimodal_Evaluation, Benchmarking
Last Updated 2026-02-14 00:00 GMT

Overview

End-to-end process for evaluating Large Multimodal Models (LMMs) against standardized benchmarks using the lmms-eval CLI, from environment setup through metric computation and results output.

Description

This workflow covers the complete evaluation pipeline for multimodal models. It starts with installing the lmms-eval framework and selecting tasks and models, then proceeds through the core evaluation loop: CLI argument parsing, task discovery and loading from YAML configurations, model instantiation via the registry, request construction from task datasets, model inference dispatch (generation or log-likelihood), output post-processing through configurable filters, metric computation, and finally results aggregation and output. The framework supports 40+ benchmark families spanning image, video, audio, and text understanding.

Usage

Execute this workflow when you need to benchmark a multimodal model against one or more evaluation tasks and produce quantitative metrics. This is the primary use case of the lmms-eval framework and applies whenever you want to measure model performance on established benchmarks like MMMU, MME, SEEDBench, Video-MME, or any of the 197+ supported tasks.

Execution Steps

Step 1: Environment Setup

Install the lmms-eval package and its dependencies. The project uses uv for dependency management. Clone the repository and run uv sync to create a consistent environment from the lockfile. This ensures all developers and CI/CD systems use exactly the same package versions.

Key considerations:

  • Use uv (not pip) for package management
  • GPU drivers and CUDA toolkit must be pre-installed for local model evaluation
  • Some models require additional dependencies (e.g., flash-attn for certain attention implementations)

Step 2: Task Selection

Choose which evaluation benchmarks to run. Tasks are specified as a comma-separated list via the --tasks CLI argument. The TaskManager discovers all available tasks by recursively scanning YAML configuration files in lmms_eval/tasks/. Tasks can be individual benchmarks (e.g., mme), task groups (e.g., mmmu which aggregates subtasks), or custom tasks loaded from external paths via --include_path.

Key considerations:

  • Use --tasks list to discover all available task names
  • Tasks support wildcard matching and grouping
  • Custom YAML task configs can be loaded from directories or file paths
  • Use --limit for quick smoke tests before full-scale evaluation

Step 3: Model Configuration

Select and configure the model to evaluate. The --model argument specifies the model type (e.g., qwen2_5_vl, llava, openai_compatible), and --model_args provides model-specific parameters as a comma-separated key=value string (e.g., pretrained=Qwen/Qwen2.5-VL-3B-Instruct). The ModelRegistryV2 resolves the model name to a concrete class, supporting both chat-template-based and simple (legacy) model interfaces.

Key considerations:

  • Model types include local HuggingFace models, API-based models (OpenAI, Claude, Gemini), and inference servers (vLLM, SGLang)
  • Set --batch_size to control throughput (1 recommended for final benchmarking runs)
  • Use --device to specify GPU placement (e.g., cuda:0)
  • The --model_args string is parsed and passed to the model constructor

Step 4: Request Construction

The evaluator builds Instance requests from each task's dataset. For each document in the evaluation split, the task constructs prompts using its doc_to_text/doc_to_visual (simple models) or doc_to_messages (chat models) functions, applies few-shot examples if configured, and creates Instance objects tagged with the request type (generate_until, loglikelihood, or multiple_choice). In distributed settings, documents are sharded across ranks.

Key considerations:

  • Request construction is parallelized across distributed ranks
  • Few-shot examples are sampled from the fewshot_split or training_split
  • Chat templates are applied when --apply_chat_template is set
  • Request caching (--cache_requests) can speed up repeated evaluations

Step 5: Model Inference

Execute the model on all constructed requests. Requests are grouped by type and dispatched to the model's generate_until() or loglikelihood() methods. For generation tasks, the model produces text responses given multimodal prompts. For multiple-choice tasks, the model computes log-probabilities for each candidate answer. Responses are collected and associated back with their originating requests.

Key considerations:

  • Batch size affects throughput and may affect some models' output consistency
  • In multi-GPU mode, requests are padded to ensure equal work across ranks
  • Generation kwargs (temperature, max_new_tokens, top_p) can be overridden via --gen_kwargs
  • Model VRAM is freed after inference completes to allow LLM-as-judge evaluation

Step 6: Post_processing and Metrics

Apply output filters and compute evaluation metrics. Each task defines a process_results function that parses model outputs and computes per-sample scores. Configurable filter pipelines (regex extraction, selection, transformation) clean model outputs before scoring. Metrics are aggregated across all samples using task-specific aggregation functions (mean, custom aggregators). In distributed mode, per-rank results are gathered to rank 0 for final aggregation. Bootstrap confidence intervals are computed for statistical robustness.

Key considerations:

  • Some tasks use GPT-4 as a judge for open-ended evaluation (e.g., video captioning)
  • Custom metrics are defined per-task in their utils.py files
  • The --predict_only flag skips metric computation for output-only runs
  • Results include standard errors computed via bootstrap resampling

Step 7: Results Output

Format and save evaluation results. The EvaluationTracker saves aggregated results as JSON files, optionally pushes to HuggingFace Hub, and generates metadata cards. A formatted results table is printed to stdout showing per-task metrics. When --log_samples is set, all per-sample model outputs are saved for post-hoc analysis. Weights & Biases integration (--wandb_args) provides experiment tracking with visualization.

Key considerations:

  • Use --output_path to specify where results are saved
  • --log_samples captures all model outputs for debugging and analysis
  • Results can be pushed to HuggingFace Hub for sharing
  • W&B logging provides interactive dashboards and experiment comparison

Execution Diagram

GitHub URL

Workflow Repository