Workflow:Open compass VLMEvalKit Image Benchmark Evaluation
| Knowledge Sources | |
|---|---|
| Domains | VLM_Evaluation, Benchmarking, Computer_Vision |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
End-to-end process for evaluating local Vision-Language Models (VLMs) on image-based benchmarks using VLMEvalKit's unified evaluation pipeline.
Description
This workflow covers the standard procedure for running a VLM through one or more image benchmarks and obtaining quantitative evaluation results. It starts with installing VLMEvalKit and configuring API keys (for LLM-based answer extraction), then proceeds through model selection, benchmark selection, distributed inference, automated evaluation, and result analysis. The toolkit handles data downloading, prompt construction, prediction generation, answer extraction, and metric calculation automatically. Over 70 image benchmarks are supported, including MCQ (MMBench, MMMU, MMStar), VQA (TextVQA, ChartQA, DocVQA), hallucination (POPE, HallusionBench), math (MathVista, MathVision), and OCR (OCRBench) tasks.
Usage
Execute this workflow when you need to evaluate a locally-running Vision-Language Model on standardized image benchmarks. You should have a GPU-equipped machine with enough VRAM for the target model, the model weights accessible (via HuggingFace or local path), and optionally an OpenAI API key for LLM-based answer extraction on open-ended tasks.
Execution Steps
Step 1: Installation and Environment Setup
Clone the VLMEvalKit repository and install it as an editable Python package. This registers the vlmeval package and the vlmutil CLI tool. Then configure API keys for any judge LLMs needed during evaluation (e.g., OpenAI key for GPT-based answer extraction). Keys are placed in a .env file at the repository root or set as environment variables.
Key considerations:
- Different VLMs require specific transformers library versions (e.g., 4.33.0 for Qwen, 4.37.0 for LLaVA)
- The OpenAI API key is optional but recommended: without it, only exact-matching evaluation is used (works for MCQ and Yes/No tasks only)
- Set VLMEVALKIT_USE_MODELSCOPE=1 to download datasets from ModelScope instead of HuggingFace
Step 2: Model Selection and Validation
Choose a VLM from the supported model registry defined in vlmeval/config.py. The registry maps model name strings to constructor functions via functools.partial(). Validate that the model loads correctly by running the vlmutil check command, which instantiates the model and runs a quick inference test.
Key considerations:
- Use vlmutil mlist all to list all supported models
- Models are grouped by required transformers version
- The model name used here must exactly match the key in the supported_VLM dictionary
- For very large models, ensure sufficient GPU VRAM or use model parallelism
Step 3: Benchmark Selection
Select one or more image benchmarks to evaluate against. Benchmarks are organized by type (MCQ, VQA, Y/N, Caption) and registered in vlmeval/dataset/__init__.py. Each benchmark has a TSV data file that is automatically downloaded on first use.
Key considerations:
- Use vlmutil dlist all to list all supported datasets
- Benchmarks are categorized into levels (l1, l2, l3) by importance
- Some benchmarks require specific judge models (e.g., MMVet uses GPT-4-turbo)
- Test splits (e.g., MMMU_TEST, DocVQA_TEST) generate submission files instead of evaluation scores
Step 4: Run Inference
Launch the evaluation pipeline via run.py. The inference engine loads the model, iterates over benchmark samples, builds prompts (using dataset-level or model-level prompt construction), generates predictions, and saves results to checkpoint files. For multi-GPU setups, use torchrun to run multiple model instances in data-parallel mode, where each rank processes a shard of the data.
What happens:
- The dataset is loaded and split across ranks for distributed processing
- Each rank builds prompts via dataset.build_prompt() or model.build_prompt()
- The model generates predictions via model.generate()
- Intermediate results are saved as pickle checkpoint files
- Results from all ranks are merged by rank 0
- If SPLIT_THINK=True is set, thinking/reasoning content is separated from the answer
Step 5: Run Evaluation
After inference completes, rank 0 runs the evaluation pipeline. Each dataset class implements an evaluate() method that loads predictions, applies post-processing (answer extraction, option matching), optionally invokes a judge LLM for open-ended tasks, computes metrics (accuracy, F1, BLEU, etc.), and writes score files.
What happens:
- Predictions are loaded from the result file
- For MCQ tasks, answers are extracted using pattern matching or LLM-based extraction
- For open-ended tasks, a judge model (e.g., GPT-4) scores predictions against ground truth
- Metrics are computed per category and overall
- Score CSV/JSON files are written to the working directory
Step 6: Review Results
Examine the evaluation output files in the working directory. Results are organized as {model_name}_{dataset_name}_*.csv files containing per-category and overall metrics. The scripts/summarize.py script can aggregate scores across multiple benchmarks into a single summary table. The vlmutil scan command can detect API failures in results.
Key considerations:
- Check {model}_{dataset}.xlsx for per-sample predictions and extracted answers
- Check {model}_{dataset}_{judge}.xlsx for judge evaluation details
- Use --reuse flag to skip already-completed inference on re-runs
- Performance may vary across environments due to library version differences