Workflow:Open compass VLMEvalKit Image Benchmark Evaluation

Knowledge Sources	VLMEvalKit Quickstart Guide Config System VLMEvalKit Technical Report
Domains	VLM_Evaluation, Benchmarking, Computer_Vision
Last Updated	2026-02-14 00:00 GMT

Overview

End-to-end process for evaluating local Vision-Language Models (VLMs) on image-based benchmarks using VLMEvalKit's unified evaluation pipeline.

Description

This workflow covers the standard procedure for running a VLM through one or more image benchmarks and obtaining quantitative evaluation results. It starts with installing VLMEvalKit and configuring API keys (for LLM-based answer extraction), then proceeds through model selection, benchmark selection, distributed inference, automated evaluation, and result analysis. The toolkit handles data downloading, prompt construction, prediction generation, answer extraction, and metric calculation automatically. Over 70 image benchmarks are supported, including MCQ (MMBench, MMMU, MMStar), VQA (TextVQA, ChartQA, DocVQA), hallucination (POPE, HallusionBench), math (MathVista, MathVision), and OCR (OCRBench) tasks.

Usage

Execute this workflow when you need to evaluate a locally-running Vision-Language Model on standardized image benchmarks. You should have a GPU-equipped machine with enough VRAM for the target model, the model weights accessible (via HuggingFace or local path), and optionally an OpenAI API key for LLM-based answer extraction on open-ended tasks.

Execution Steps

Step 1: Installation and Environment Setup

Clone the VLMEvalKit repository and install it as an editable Python package. This registers the vlmeval package and the vlmutil CLI tool. Then configure API keys for any judge LLMs needed during evaluation (e.g., OpenAI key for GPT-based answer extraction). Keys are placed in a .env file at the repository root or set as environment variables.

Key considerations:

Different VLMs require specific transformers library versions (e.g., 4.33.0 for Qwen, 4.37.0 for LLaVA)
The OpenAI API key is optional but recommended: without it, only exact-matching evaluation is used (works for MCQ and Yes/No tasks only)
Set VLMEVALKIT_USE_MODELSCOPE=1 to download datasets from ModelScope instead of HuggingFace

Step 2: Model Selection and Validation

Choose a VLM from the supported model registry defined in vlmeval/config.py. The registry maps model name strings to constructor functions via functools.partial(). Validate that the model loads correctly by running the vlmutil check command, which instantiates the model and runs a quick inference test.

Key considerations:

Use vlmutil mlist all to list all supported models
Models are grouped by required transformers version
The model name used here must exactly match the key in the supported_VLM dictionary
For very large models, ensure sufficient GPU VRAM or use model parallelism

Step 3: Benchmark Selection

Select one or more image benchmarks to evaluate against. Benchmarks are organized by type (MCQ, VQA, Y/N, Caption) and registered in vlmeval/dataset/__init__.py. Each benchmark has a TSV data file that is automatically downloaded on first use.

Key considerations:

Use vlmutil dlist all to list all supported datasets
Benchmarks are categorized into levels (l1, l2, l3) by importance
Some benchmarks require specific judge models (e.g., MMVet uses GPT-4-turbo)
Test splits (e.g., MMMU_TEST, DocVQA_TEST) generate submission files instead of evaluation scores

Step 4: Run Inference

Launch the evaluation pipeline via run.py. The inference engine loads the model, iterates over benchmark samples, builds prompts (using dataset-level or model-level prompt construction), generates predictions, and saves results to checkpoint files. For multi-GPU setups, use torchrun to run multiple model instances in data-parallel mode, where each rank processes a shard of the data.

What happens:

The dataset is loaded and split across ranks for distributed processing
Each rank builds prompts via dataset.build_prompt() or model.build_prompt()
The model generates predictions via model.generate()
Intermediate results are saved as pickle checkpoint files
Results from all ranks are merged by rank 0
If SPLIT_THINK=True is set, thinking/reasoning content is separated from the answer

Step 5: Run Evaluation

After inference completes, rank 0 runs the evaluation pipeline. Each dataset class implements an evaluate() method that loads predictions, applies post-processing (answer extraction, option matching), optionally invokes a judge LLM for open-ended tasks, computes metrics (accuracy, F1, BLEU, etc.), and writes score files.

What happens:

Predictions are loaded from the result file
For MCQ tasks, answers are extracted using pattern matching or LLM-based extraction
For open-ended tasks, a judge model (e.g., GPT-4) scores predictions against ground truth
Metrics are computed per category and overall
Score CSV/JSON files are written to the working directory

Step 6: Review Results

Examine the evaluation output files in the working directory. Results are organized as {model_name}_{dataset_name}_*.csv files containing per-category and overall metrics. The scripts/summarize.py script can aggregate scores across multiple benchmarks into a single summary table. The vlmutil scan command can detect API failures in results.

Key considerations:

Check {model}_{dataset}.xlsx for per-sample predictions and extracted answers
Check {model}_{dataset}_{judge}.xlsx for judge evaluation details
Use --reuse flag to skip already-completed inference on re-runs
Performance may vary across environments due to library version differences

Execution Diagram

GitHub URL

Workflow Repository