Workflow:Open compass VLMEvalKit Adding Custom Benchmark
| Knowledge Sources | |
|---|---|
| Domains | VLM_Evaluation, Benchmark_Development, Development |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Process for adding a new evaluation benchmark to VLMEvalKit, enabling all supported VLMs to be evaluated on a custom dataset.
Description
This workflow guides developers through implementing a new benchmark in VLMEvalKit. The process involves preparing a TSV data file with standardized fields, implementing a dataset class with build_prompt() and evaluate() methods, registering the dataset, and validating the evaluation pipeline. Once integrated, any VLM supported by VLMEvalKit can be evaluated on the new benchmark. Developers can reuse existing dataset base classes (ImageMCQDataset for multiple-choice, ImageBaseDataset for VQA, VideoBaseDataset for video, etc.) to minimize implementation effort.
Usage
Execute this workflow when you want to evaluate VLMs on a dataset or benchmark not yet supported by VLMEvalKit. You should have the benchmark data (images/videos and questions with ground-truth answers), a clear definition of evaluation metrics, and familiarity with Python class inheritance and pandas DataFrames.
Execution Steps
Step 1: Prepare Benchmark TSV File
Organize the benchmark data into a single TSV file with standardized fields. Each row represents one evaluation sample. Images are stored as base64-encoded strings within the TSV. The file should be uploaded to a downloadable location (e.g., HuggingFace) so VLMEvalKit can auto-download it.
Required fields:
- index - Unique integer identifier for each sample
- image - Base64-encoded image data (use encode_image_to_base64 from vlmeval/smp/vlm.py)
- question - The text question for the VLM
- answer - Ground-truth answer (not needed for test splits)
Optional fields:
- hint - Additional context or instructions
- A, B, C, D - Multiple-choice options
- category - Category label for per-category metrics
- l2-category - Sub-category label
- image_path - Alternative to base64 for multi-image datasets
- split - Train/dev/test split identifier
Step 2: Choose Base Dataset Class
Select the appropriate base class to inherit from based on the benchmark type. Each base class provides default implementations for data loading, prompt construction, and basic evaluation.
Available base classes:
- ImageMCQDataset - For multiple-choice question benchmarks
- ImageBaseDataset - For general image-based VQA benchmarks
- VideoBaseDataset - For video understanding benchmarks
- TextBaseDataset - For text-only benchmarks (no images)
- ImageYORNDataset - For yes-or-no question benchmarks
Step 3: Implement Dataset Class
Create a new Python file in vlmeval/dataset/ with a class inheriting from the chosen base class. Set class attributes including DATASET_URL (download URL), DATASET_MD5 (checksum), and TYPE (benchmark category). The TYPE determines default evaluation behavior (e.g., "MCQ", "VQA", "Y/N").
Key considerations:
- Set MODALITY to "IMAGE" or "VIDEO" as appropriate
- DATASET_URL is a dictionary mapping dataset name to download URL
- DATASET_MD5 provides integrity checking for downloaded files
- The dataset is auto-downloaded to $LMUData on first use
Step 4: Implement build_prompt Method
Implement build_prompt(self, line) to construct the multi-modal message that will be sent to VLMs. The method receives a data sample (as a pandas Series or index integer) and returns a list of message dictionaries in the format [dict(type='image', value=IMAGE_PATH), dict(type='text', value=PROMPT)].
Key considerations:
- Combine hint, question, and options (if MCQ) into the text prompt
- Include appropriate task instructions (e.g., "Select the correct answer")
- For multi-image samples, include multiple image dictionaries at appropriate positions
- The default ImageBaseDataset.build_prompt handles common MCQ formatting
Step 5: Implement evaluate Method
Implement evaluate(self, eval_file, **judge_kwargs) to compute benchmark metrics. The method receives the path to the prediction file and judge configuration, processes predictions, and returns metrics as a dictionary or DataFrame.
What happens:
- Load predictions from the eval file using load()
- Apply answer extraction or post-processing to raw model outputs
- For MCQ: extract option letters from predictions using matching utilities
- For open-ended: optionally use a judge LLM (available via judge_kwargs) to score responses
- Compute metrics (accuracy, F1, BLEU, CIDEr, etc.) overall and per-category
- Return results as a dictionary of lists or a pandas DataFrame
- Write score files to the working directory
Step 6: Register and Validate
Register the dataset in vlmeval/dataset/__init__.py by adding it to the SUPPORTED_DATASETS dictionary and importing the class. For video datasets, also add pre-configured settings in vlmeval/dataset/video_dataset_config.py. Then validate by running a small evaluation.
Key considerations:
- The dataset name in SUPPORTED_DATASETS is used with --data in run.py
- Add the appropriate judge model selection in run.py if the benchmark needs a specific judge
- Test with at least one local model and one API model
- Run pre-commit checks before submitting