Principle:Open compass VLMEvalKit MCQ Evaluation

Field	Value
Source	VLMEvalKit\|https://github.com/open-compass/VLMEvalKit
Domain	Vision, Evaluation, NLP
Last Updated	2026-02-14 00:00 GMT

Overview

An evaluation methodology that extracts and scores multiple-choice answers from VLM predictions using a multi-stage heuristic-then-LLM pipeline.

Description

MCQ evaluation in VLMEvalKit follows a two-stage approach. First, heuristic rules attempt to extract the answer choice (A/B/C/D) from the model's prediction text using pattern matching (regex for option letters, prefix matching, etc.). When heuristics fail, an LLM judge (GPT-3.5/GPT-4) is used as a fallback to interpret the model's free-form response. Results are reported as accuracy by split and category. The system also supports a verifier mode and circular evaluation for robustness.

Usage

Use for evaluating any MCQ-type benchmark (MMBench, AI2D, MMStar, SEEDBench, ScienceQA, MMMU, etc.). The evaluation is triggered by calling dataset.evaluate(eval_file) on an ImageMCQDataset instance.

Theoretical Basis

Two-stage answer extraction:

Heuristic extraction via regex patterns and string matching — fast and deterministic.
LLM-based extraction — handles complex/ambiguous responses.

Accuracy = correct / total per split/category.

Pseudocode:

for each prediction:
    answer = try_heuristic_extract(prediction)
    if answer is None:
        answer = llm_judge_extract(prediction)
    correct += (answer == ground_truth)
accuracy = correct / total  # per split/category

Related Pages

Implementation:Open_compass_VLMEvalKit_ImageMCQDataset_Evaluate

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment