Implementation:Open compass VLMEvalKit ImageMCQDataset Evaluate

Field	Value
Source	VLMEvalKit
Domain	Vision, Evaluation, NLP

Overview

Concrete tool for evaluating VLM predictions on multiple-choice benchmarks using heuristic and LLM-based answer extraction provided by VLMEvalKit.

Description

ImageMCQDataset.evaluate() in vlmeval/dataset/image_mcq.py dispatches to either evaluate_heuristic() (default) or evaluate_verifier() (when use_verifier=True). The heuristic path uses extract_answer_from_item() from vlmeval/dataset/utils/multiple_choice.py which applies regex-based answer extraction with LLM fallback via build_judge(). Results are computed as accuracy per split/category and saved to _acc.csv.

Usage

Called after inference completes. Requires the prediction file from the inference step. Optionally requires a judge LLM API key for LLM-based extraction fallback.

Code Reference

Source: vlmeval/dataset/image_mcq.py, Lines: L236-240 (evaluate entry), L42-465 (full class)
Also: vlmeval/dataset/utils/multiple_choice.py, Lines: L350-499 (answer extraction)
Signature:

def evaluate(self, eval_file: str, **judge_kwargs) -> Union[pd.DataFrame, dict]:
    """
    Args:
        eval_file: Path to predictions file (xlsx/csv/tsv).
        **judge_kwargs: Keyword arguments including:
            - model (str): Judge model name (e.g., "chatgpt-0125")
            - nproc (int): Parallel judge calls
            - use_verifier (bool): Use verifier mode instead of heuristic
    Returns:
        DataFrame with accuracy by split/category, or dict with scores.
    """

Import: (method on ImageMCQDataset class) from vlmeval.dataset import ImageMCQDataset

I/O Contract

Direction	Name	Type	Description
Input	eval_file	str	Path to prediction file with columns: index, prediction, answer, A, B, C, D
Input	judge_kwargs	dict	LLM judge config (model name, nproc, use_verifier)
Output	results	DataFrame	Accuracy per split/category
Side Effect	_acc.csv	file	Saves accuracy results to disk

Usage Examples

from vlmeval.dataset import build_dataset

dataset = build_dataset("MMBench_DEV_EN_V11")
# After inference produces the prediction file:
results = dataset.evaluate(
    eval_file="./results/InternVL2-8B_MMBench_DEV_EN_V11.xlsx",
    model="chatgpt-0125",
    nproc=4
)
print(results)  # DataFrame with accuracy by split/category

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment