Implementation:Open compass VLMEvalKit ImageMCQDataset Evaluate
| Field | Value |
|---|---|
| Source | VLMEvalKit |
| Domain | Vision, Evaluation, NLP |
Overview
Concrete tool for evaluating VLM predictions on multiple-choice benchmarks using heuristic and LLM-based answer extraction provided by VLMEvalKit.
Description
ImageMCQDataset.evaluate() in vlmeval/dataset/image_mcq.py dispatches to either evaluate_heuristic() (default) or evaluate_verifier() (when use_verifier=True). The heuristic path uses extract_answer_from_item() from vlmeval/dataset/utils/multiple_choice.py which applies regex-based answer extraction with LLM fallback via build_judge(). Results are computed as accuracy per split/category and saved to _acc.csv.
Usage
Called after inference completes. Requires the prediction file from the inference step. Optionally requires a judge LLM API key for LLM-based extraction fallback.
Code Reference
- Source:
vlmeval/dataset/image_mcq.py, Lines: L236-240 (evaluate entry), L42-465 (full class) - Also:
vlmeval/dataset/utils/multiple_choice.py, Lines: L350-499 (answer extraction) - Signature:
def evaluate(self, eval_file: str, **judge_kwargs) -> Union[pd.DataFrame, dict]:
"""
Args:
eval_file: Path to predictions file (xlsx/csv/tsv).
**judge_kwargs: Keyword arguments including:
- model (str): Judge model name (e.g., "chatgpt-0125")
- nproc (int): Parallel judge calls
- use_verifier (bool): Use verifier mode instead of heuristic
Returns:
DataFrame with accuracy by split/category, or dict with scores.
"""
- Import: (method on ImageMCQDataset class)
from vlmeval.dataset import ImageMCQDataset
I/O Contract
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | eval_file | str | Path to prediction file with columns: index, prediction, answer, A, B, C, D |
| Input | judge_kwargs | dict | LLM judge config (model name, nproc, use_verifier) |
| Output | results | DataFrame | Accuracy per split/category |
| Side Effect | _acc.csv | file | Saves accuracy results to disk |
Usage Examples
from vlmeval.dataset import build_dataset
dataset = build_dataset("MMBench_DEV_EN_V11")
# After inference produces the prediction file:
results = dataset.evaluate(
eval_file="./results/InternVL2-8B_MMBench_DEV_EN_V11.xlsx",
model="chatgpt-0125",
nproc=4
)
print(results) # DataFrame with accuracy by split/category