Workflow:EvolvingLMMs Lab Lmms eval End to End Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Multimodal_Evaluation, Benchmarking |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
End-to-end process for evaluating Large Multimodal Models (LMMs) against standardized benchmarks using the lmms-eval CLI, from environment setup through metric computation and results output.
Description
This workflow covers the complete evaluation pipeline for multimodal models. It starts with installing the lmms-eval framework and selecting tasks and models, then proceeds through the core evaluation loop: CLI argument parsing, task discovery and loading from YAML configurations, model instantiation via the registry, request construction from task datasets, model inference dispatch (generation or log-likelihood), output post-processing through configurable filters, metric computation, and finally results aggregation and output. The framework supports 40+ benchmark families spanning image, video, audio, and text understanding.
Usage
Execute this workflow when you need to benchmark a multimodal model against one or more evaluation tasks and produce quantitative metrics. This is the primary use case of the lmms-eval framework and applies whenever you want to measure model performance on established benchmarks like MMMU, MME, SEEDBench, Video-MME, or any of the 197+ supported tasks.
Execution Steps
Step 1: Environment Setup
Install the lmms-eval package and its dependencies. The project uses uv for dependency management. Clone the repository and run uv sync to create a consistent environment from the lockfile. This ensures all developers and CI/CD systems use exactly the same package versions.
Key considerations:
- Use uv (not pip) for package management
- GPU drivers and CUDA toolkit must be pre-installed for local model evaluation
- Some models require additional dependencies (e.g., flash-attn for certain attention implementations)
Step 2: Task Selection
Choose which evaluation benchmarks to run. Tasks are specified as a comma-separated list via the --tasks CLI argument. The TaskManager discovers all available tasks by recursively scanning YAML configuration files in lmms_eval/tasks/. Tasks can be individual benchmarks (e.g., mme), task groups (e.g., mmmu which aggregates subtasks), or custom tasks loaded from external paths via --include_path.
Key considerations:
- Use --tasks list to discover all available task names
- Tasks support wildcard matching and grouping
- Custom YAML task configs can be loaded from directories or file paths
- Use --limit for quick smoke tests before full-scale evaluation
Step 3: Model Configuration
Select and configure the model to evaluate. The --model argument specifies the model type (e.g., qwen2_5_vl, llava, openai_compatible), and --model_args provides model-specific parameters as a comma-separated key=value string (e.g., pretrained=Qwen/Qwen2.5-VL-3B-Instruct). The ModelRegistryV2 resolves the model name to a concrete class, supporting both chat-template-based and simple (legacy) model interfaces.
Key considerations:
- Model types include local HuggingFace models, API-based models (OpenAI, Claude, Gemini), and inference servers (vLLM, SGLang)
- Set --batch_size to control throughput (1 recommended for final benchmarking runs)
- Use --device to specify GPU placement (e.g., cuda:0)
- The --model_args string is parsed and passed to the model constructor
Step 4: Request Construction
The evaluator builds Instance requests from each task's dataset. For each document in the evaluation split, the task constructs prompts using its doc_to_text/doc_to_visual (simple models) or doc_to_messages (chat models) functions, applies few-shot examples if configured, and creates Instance objects tagged with the request type (generate_until, loglikelihood, or multiple_choice). In distributed settings, documents are sharded across ranks.
Key considerations:
- Request construction is parallelized across distributed ranks
- Few-shot examples are sampled from the fewshot_split or training_split
- Chat templates are applied when --apply_chat_template is set
- Request caching (--cache_requests) can speed up repeated evaluations
Step 5: Model Inference
Execute the model on all constructed requests. Requests are grouped by type and dispatched to the model's generate_until() or loglikelihood() methods. For generation tasks, the model produces text responses given multimodal prompts. For multiple-choice tasks, the model computes log-probabilities for each candidate answer. Responses are collected and associated back with their originating requests.
Key considerations:
- Batch size affects throughput and may affect some models' output consistency
- In multi-GPU mode, requests are padded to ensure equal work across ranks
- Generation kwargs (temperature, max_new_tokens, top_p) can be overridden via --gen_kwargs
- Model VRAM is freed after inference completes to allow LLM-as-judge evaluation
Step 6: Post_processing and Metrics
Apply output filters and compute evaluation metrics. Each task defines a process_results function that parses model outputs and computes per-sample scores. Configurable filter pipelines (regex extraction, selection, transformation) clean model outputs before scoring. Metrics are aggregated across all samples using task-specific aggregation functions (mean, custom aggregators). In distributed mode, per-rank results are gathered to rank 0 for final aggregation. Bootstrap confidence intervals are computed for statistical robustness.
Key considerations:
- Some tasks use GPT-4 as a judge for open-ended evaluation (e.g., video captioning)
- Custom metrics are defined per-task in their utils.py files
- The --predict_only flag skips metric computation for output-only runs
- Results include standard errors computed via bootstrap resampling
Step 7: Results Output
Format and save evaluation results. The EvaluationTracker saves aggregated results as JSON files, optionally pushes to HuggingFace Hub, and generates metadata cards. A formatted results table is printed to stdout showing per-task metrics. When --log_samples is set, all per-sample model outputs are saved for post-hoc analysis. Weights & Biases integration (--wandb_args) provides experiment tracking with visualization.
Key considerations:
- Use --output_path to specify where results are saved
- --log_samples captures all model outputs for debugging and analysis
- Results can be pushed to HuggingFace Hub for sharing
- W&B logging provides interactive dashboards and experiment comparison