Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Open compass VLMEvalKit Adding Custom Benchmark

From Leeroopedia
Knowledge Sources
Domains VLM_Evaluation, Benchmark_Development, Development
Last Updated 2026-02-14 00:00 GMT

Overview

Process for adding a new evaluation benchmark to VLMEvalKit, enabling all supported VLMs to be evaluated on a custom dataset.

Description

This workflow guides developers through implementing a new benchmark in VLMEvalKit. The process involves preparing a TSV data file with standardized fields, implementing a dataset class with build_prompt() and evaluate() methods, registering the dataset, and validating the evaluation pipeline. Once integrated, any VLM supported by VLMEvalKit can be evaluated on the new benchmark. Developers can reuse existing dataset base classes (ImageMCQDataset for multiple-choice, ImageBaseDataset for VQA, VideoBaseDataset for video, etc.) to minimize implementation effort.

Usage

Execute this workflow when you want to evaluate VLMs on a dataset or benchmark not yet supported by VLMEvalKit. You should have the benchmark data (images/videos and questions with ground-truth answers), a clear definition of evaluation metrics, and familiarity with Python class inheritance and pandas DataFrames.

Execution Steps

Step 1: Prepare Benchmark TSV File

Organize the benchmark data into a single TSV file with standardized fields. Each row represents one evaluation sample. Images are stored as base64-encoded strings within the TSV. The file should be uploaded to a downloadable location (e.g., HuggingFace) so VLMEvalKit can auto-download it.

Required fields:

  • index - Unique integer identifier for each sample
  • image - Base64-encoded image data (use encode_image_to_base64 from vlmeval/smp/vlm.py)
  • question - The text question for the VLM
  • answer - Ground-truth answer (not needed for test splits)

Optional fields:

  • hint - Additional context or instructions
  • A, B, C, D - Multiple-choice options
  • category - Category label for per-category metrics
  • l2-category - Sub-category label
  • image_path - Alternative to base64 for multi-image datasets
  • split - Train/dev/test split identifier

Step 2: Choose Base Dataset Class

Select the appropriate base class to inherit from based on the benchmark type. Each base class provides default implementations for data loading, prompt construction, and basic evaluation.

Available base classes:

  • ImageMCQDataset - For multiple-choice question benchmarks
  • ImageBaseDataset - For general image-based VQA benchmarks
  • VideoBaseDataset - For video understanding benchmarks
  • TextBaseDataset - For text-only benchmarks (no images)
  • ImageYORNDataset - For yes-or-no question benchmarks

Step 3: Implement Dataset Class

Create a new Python file in vlmeval/dataset/ with a class inheriting from the chosen base class. Set class attributes including DATASET_URL (download URL), DATASET_MD5 (checksum), and TYPE (benchmark category). The TYPE determines default evaluation behavior (e.g., "MCQ", "VQA", "Y/N").

Key considerations:

  • Set MODALITY to "IMAGE" or "VIDEO" as appropriate
  • DATASET_URL is a dictionary mapping dataset name to download URL
  • DATASET_MD5 provides integrity checking for downloaded files
  • The dataset is auto-downloaded to $LMUData on first use

Step 4: Implement build_prompt Method

Implement build_prompt(self, line) to construct the multi-modal message that will be sent to VLMs. The method receives a data sample (as a pandas Series or index integer) and returns a list of message dictionaries in the format [dict(type='image', value=IMAGE_PATH), dict(type='text', value=PROMPT)].

Key considerations:

  • Combine hint, question, and options (if MCQ) into the text prompt
  • Include appropriate task instructions (e.g., "Select the correct answer")
  • For multi-image samples, include multiple image dictionaries at appropriate positions
  • The default ImageBaseDataset.build_prompt handles common MCQ formatting

Step 5: Implement evaluate Method

Implement evaluate(self, eval_file, **judge_kwargs) to compute benchmark metrics. The method receives the path to the prediction file and judge configuration, processes predictions, and returns metrics as a dictionary or DataFrame.

What happens:

  • Load predictions from the eval file using load()
  • Apply answer extraction or post-processing to raw model outputs
  • For MCQ: extract option letters from predictions using matching utilities
  • For open-ended: optionally use a judge LLM (available via judge_kwargs) to score responses
  • Compute metrics (accuracy, F1, BLEU, CIDEr, etc.) overall and per-category
  • Return results as a dictionary of lists or a pandas DataFrame
  • Write score files to the working directory

Step 6: Register and Validate

Register the dataset in vlmeval/dataset/__init__.py by adding it to the SUPPORTED_DATASETS dictionary and importing the class. For video datasets, also add pre-configured settings in vlmeval/dataset/video_dataset_config.py. Then validate by running a small evaluation.

Key considerations:

  • The dataset name in SUPPORTED_DATASETS is used with --data in run.py
  • Add the appropriate judge model selection in run.py if the benchmark needs a specific judge
  • Test with at least one local model and one API model
  • Run pre-commit checks before submitting

Execution Diagram

GitHub URL

Workflow Repository