Workflow:Open compass VLMEvalKit Adding Custom Benchmark

Knowledge Sources	VLMEvalKit Development Guide
Domains	VLM_Evaluation, Benchmark_Development, Development
Last Updated	2026-02-14 00:00 GMT

Overview

Process for adding a new evaluation benchmark to VLMEvalKit, enabling all supported VLMs to be evaluated on a custom dataset.

Description

This workflow guides developers through implementing a new benchmark in VLMEvalKit. The process involves preparing a TSV data file with standardized fields, implementing a dataset class with build_prompt() and evaluate() methods, registering the dataset, and validating the evaluation pipeline. Once integrated, any VLM supported by VLMEvalKit can be evaluated on the new benchmark. Developers can reuse existing dataset base classes (ImageMCQDataset for multiple-choice, ImageBaseDataset for VQA, VideoBaseDataset for video, etc.) to minimize implementation effort.

Usage

Execute this workflow when you want to evaluate VLMs on a dataset or benchmark not yet supported by VLMEvalKit. You should have the benchmark data (images/videos and questions with ground-truth answers), a clear definition of evaluation metrics, and familiarity with Python class inheritance and pandas DataFrames.

Execution Steps

Step 1: Prepare Benchmark TSV File

Organize the benchmark data into a single TSV file with standardized fields. Each row represents one evaluation sample. Images are stored as base64-encoded strings within the TSV. The file should be uploaded to a downloadable location (e.g., HuggingFace) so VLMEvalKit can auto-download it.

Required fields:

index - Unique integer identifier for each sample
image - Base64-encoded image data (use encode_image_to_base64 from vlmeval/smp/vlm.py)
question - The text question for the VLM
answer - Ground-truth answer (not needed for test splits)

Optional fields:

hint - Additional context or instructions
A, B, C, D - Multiple-choice options
category - Category label for per-category metrics
l2-category - Sub-category label
image_path - Alternative to base64 for multi-image datasets
split - Train/dev/test split identifier

Step 2: Choose Base Dataset Class

Select the appropriate base class to inherit from based on the benchmark type. Each base class provides default implementations for data loading, prompt construction, and basic evaluation.

Available base classes:

ImageMCQDataset - For multiple-choice question benchmarks
ImageBaseDataset - For general image-based VQA benchmarks
VideoBaseDataset - For video understanding benchmarks
TextBaseDataset - For text-only benchmarks (no images)
ImageYORNDataset - For yes-or-no question benchmarks

Step 3: Implement Dataset Class

Create a new Python file in vlmeval/dataset/ with a class inheriting from the chosen base class. Set class attributes including DATASET_URL (download URL), DATASET_MD5 (checksum), and TYPE (benchmark category). The TYPE determines default evaluation behavior (e.g., "MCQ", "VQA", "Y/N").

Key considerations:

Set MODALITY to "IMAGE" or "VIDEO" as appropriate
DATASET_URL is a dictionary mapping dataset name to download URL
DATASET_MD5 provides integrity checking for downloaded files
The dataset is auto-downloaded to $LMUData on first use

Step 4: Implement build_prompt Method

Implement build_prompt(self, line) to construct the multi-modal message that will be sent to VLMs. The method receives a data sample (as a pandas Series or index integer) and returns a list of message dictionaries in the format [dict(type='image', value=IMAGE_PATH), dict(type='text', value=PROMPT)].

Key considerations:

Combine hint, question, and options (if MCQ) into the text prompt
Include appropriate task instructions (e.g., "Select the correct answer")
For multi-image samples, include multiple image dictionaries at appropriate positions
The default ImageBaseDataset.build_prompt handles common MCQ formatting

Step 5: Implement evaluate Method

Implement evaluate(self, eval_file, **judge_kwargs) to compute benchmark metrics. The method receives the path to the prediction file and judge configuration, processes predictions, and returns metrics as a dictionary or DataFrame.

What happens:

Load predictions from the eval file using load()
Apply answer extraction or post-processing to raw model outputs
For MCQ: extract option letters from predictions using matching utilities
For open-ended: optionally use a judge LLM (available via judge_kwargs) to score responses
Compute metrics (accuracy, F1, BLEU, CIDEr, etc.) overall and per-category
Return results as a dictionary of lists or a pandas DataFrame
Write score files to the working directory

Step 6: Register and Validate

Register the dataset in vlmeval/dataset/__init__.py by adding it to the SUPPORTED_DATASETS dictionary and importing the class. For video datasets, also add pre-configured settings in vlmeval/dataset/video_dataset_config.py. Then validate by running a small evaluation.

Key considerations:

The dataset name in SUPPORTED_DATASETS is used with --data in run.py
Add the appropriate judge model selection in run.py if the benchmark needs a specific judge
Test with at least one local model and one API model
Run pre-commit checks before submitting

Execution Diagram

GitHub URL

Workflow Repository