Principle:Open compass VLMEvalKit MCQ Evaluation
| Field | Value |
|---|---|
| Source | VLMEvalKit|https://github.com/open-compass/VLMEvalKit |
| Domain | Vision, Evaluation, NLP |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
An evaluation methodology that extracts and scores multiple-choice answers from VLM predictions using a multi-stage heuristic-then-LLM pipeline.
Description
MCQ evaluation in VLMEvalKit follows a two-stage approach. First, heuristic rules attempt to extract the answer choice (A/B/C/D) from the model's prediction text using pattern matching (regex for option letters, prefix matching, etc.). When heuristics fail, an LLM judge (GPT-3.5/GPT-4) is used as a fallback to interpret the model's free-form response. Results are reported as accuracy by split and category. The system also supports a verifier mode and circular evaluation for robustness.
Usage
Use for evaluating any MCQ-type benchmark (MMBench, AI2D, MMStar, SEEDBench, ScienceQA, MMMU, etc.). The evaluation is triggered by calling dataset.evaluate(eval_file) on an ImageMCQDataset instance.
Theoretical Basis
Two-stage answer extraction:
- Heuristic extraction via regex patterns and string matching — fast and deterministic.
- LLM-based extraction — handles complex/ambiguous responses.
Accuracy = correct / total per split/category.
Pseudocode:
for each prediction:
answer = try_heuristic_extract(prediction)
if answer is None:
answer = llm_judge_extract(prediction)
correct += (answer == ground_truth)
accuracy = correct / total # per split/category