Heuristic:Open compass VLMEvalKit Judge Model Selection By Dataset
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Evaluation |
| Last Updated | 2026-02-14 01:30 GMT |
Overview
Decision framework for selecting the appropriate LLM judge model for each benchmark dataset type, balancing evaluation quality against API cost.
Description
VLMEvalKit uses LLM-as-judge for evaluating open-ended and subjective responses. The judge model is selected automatically based on the dataset name and type. MCQ and Y/N datasets use the cheapest judge (`chatgpt-0125`) since answer extraction is relatively simple. Open-ended benchmarks requiring deep understanding use more expensive models (`gpt-4o`, `gpt-4-turbo`). Math benchmarks use `gpt-4o-mini` for answer equivalence checking. The judge selection can be overridden with `--judge` on the CLI.
Usage
Apply this heuristic when configuring evaluation pipelines. The default judge selection in `run.py:365-407` is carefully tuned per benchmark. Override with `--judge` only when you have a specific reason (e.g., using a local judge model).
The Insight (Rule of Thumb)
- MCQ / Y/N datasets: Use `chatgpt-0125` (cheap, sufficient for option extraction)
- MMVet, LLaVABench, MMBench-Video: Use `gpt-4-turbo` (requires nuanced scoring)
- MathVista, MathVerse, DynaMath, LogicVista: Use `gpt-4o-mini` (math answer extraction)
- MMLongBench, MMDU, MIA-Bench, WildVision, MM-IFEval: Use `gpt-4o` (complex open-ended)
- VDC: Use `llama31-8b` (video detailed captioning)
- Video-MMLU: Use `qwen-72b` (knowledge-intensive video QA)
- WeMath, MME-Reasoning: Use `gpt-4o-mini` (reasoning verification)
- VisuLogic: Use `exact_matching` (no LLM judge needed)
- Trade-off: Cheaper judges may miss nuance; expensive judges increase API costs significantly for large-scale evaluations.
Reasoning
The judge model must match the evaluation complexity. MCQ extraction is a pattern-matching task that any capable model can handle cheaply. Open-ended evaluation requires understanding context, following rubrics, and making subjective judgments, necessitating stronger (and more expensive) models. Math answer checking needs symbolic reasoning but not creative judgment, making mid-tier models appropriate. The per-benchmark tuning reflects empirical experience with judge accuracy across dozens of benchmarks.
Code Evidence
Judge selection logic from `run.py:365-407`:
if dataset.TYPE in ['MCQ', 'Y/N', 'MCQ_MMMU_Pro'] or listinstr(
['moviechat1k', 'mme-reasoning'], dataset_name.lower()
):
if listinstr(['WeMath', 'MME-Reasoning'], dataset_name):
judge_kwargs['model'] = 'gpt-4o-mini'
elif listinstr(['VisuLogic'], dataset_name):
judge_kwargs['model'] = 'exact_matching'
else:
judge_kwargs['model'] = 'chatgpt-0125'
elif listinstr(['MMVet', 'LLaVABench', 'MMBench_Video'], dataset_name):
if listinstr(['LLaVABench_KO'], dataset_name):
judge_kwargs['model'] = 'gpt-4o-0806'
else:
judge_kwargs['model'] = 'gpt-4-turbo'
elif listinstr(['MathVista', 'MathVerse', 'MathVision', 'LENS', 'DynaMath',
'VL-RewardBench', 'LogicVista', 'MOAT', 'OCR_Reasoning'], dataset_name):
judge_kwargs['model'] = 'gpt-4o-mini'
elif listinstr(['MMLongBench', 'MMDU', 'DUDE', 'SLIDEVQA', 'MIA-Bench',
'WildVision', 'MMAlignBench', 'MM-IFEval'], dataset_name):
judge_kwargs['model'] = 'gpt-4o'
Default judge configuration from `run.py:352-357`:
judge_kwargs = {
'nproc': args.api_nproc,
'verbose': args.verbose,
'retry': args.retry if args.retry is not None else 3,
**(json.loads(args.judge_args) if args.judge_args else {}),
}