Heuristic:Open compass VLMEvalKit Judge Model Selection By Dataset

Knowledge Sources	VLMEvalKit
Domains	LLMs, Evaluation
Last Updated	2026-02-14 01:30 GMT

Overview

Decision framework for selecting the appropriate LLM judge model for each benchmark dataset type, balancing evaluation quality against API cost.

Description

VLMEvalKit uses LLM-as-judge for evaluating open-ended and subjective responses. The judge model is selected automatically based on the dataset name and type. MCQ and Y/N datasets use the cheapest judge (`chatgpt-0125`) since answer extraction is relatively simple. Open-ended benchmarks requiring deep understanding use more expensive models (`gpt-4o`, `gpt-4-turbo`). Math benchmarks use `gpt-4o-mini` for answer equivalence checking. The judge selection can be overridden with `--judge` on the CLI.

Usage

Apply this heuristic when configuring evaluation pipelines. The default judge selection in `run.py:365-407` is carefully tuned per benchmark. Override with `--judge` only when you have a specific reason (e.g., using a local judge model).

The Insight (Rule of Thumb)

MCQ / Y/N datasets: Use `chatgpt-0125` (cheap, sufficient for option extraction)
MMVet, LLaVABench, MMBench-Video: Use `gpt-4-turbo` (requires nuanced scoring)
MathVista, MathVerse, DynaMath, LogicVista: Use `gpt-4o-mini` (math answer extraction)
MMLongBench, MMDU, MIA-Bench, WildVision, MM-IFEval: Use `gpt-4o` (complex open-ended)
VDC: Use `llama31-8b` (video detailed captioning)
Video-MMLU: Use `qwen-72b` (knowledge-intensive video QA)
WeMath, MME-Reasoning: Use `gpt-4o-mini` (reasoning verification)
VisuLogic: Use `exact_matching` (no LLM judge needed)
Trade-off: Cheaper judges may miss nuance; expensive judges increase API costs significantly for large-scale evaluations.

Reasoning

The judge model must match the evaluation complexity. MCQ extraction is a pattern-matching task that any capable model can handle cheaply. Open-ended evaluation requires understanding context, following rubrics, and making subjective judgments, necessitating stronger (and more expensive) models. Math answer checking needs symbolic reasoning but not creative judgment, making mid-tier models appropriate. The per-benchmark tuning reflects empirical experience with judge accuracy across dozens of benchmarks.

Code Evidence

Judge selection logic from `run.py:365-407`:

if dataset.TYPE in ['MCQ', 'Y/N', 'MCQ_MMMU_Pro'] or listinstr(
    ['moviechat1k', 'mme-reasoning'], dataset_name.lower()
):
    if listinstr(['WeMath', 'MME-Reasoning'], dataset_name):
        judge_kwargs['model'] = 'gpt-4o-mini'
    elif listinstr(['VisuLogic'], dataset_name):
        judge_kwargs['model'] = 'exact_matching'
    else:
        judge_kwargs['model'] = 'chatgpt-0125'
elif listinstr(['MMVet', 'LLaVABench', 'MMBench_Video'], dataset_name):
    if listinstr(['LLaVABench_KO'], dataset_name):
        judge_kwargs['model'] = 'gpt-4o-0806'
    else:
        judge_kwargs['model'] = 'gpt-4-turbo'
elif listinstr(['MathVista', 'MathVerse', 'MathVision', 'LENS', 'DynaMath',
                'VL-RewardBench', 'LogicVista', 'MOAT', 'OCR_Reasoning'], dataset_name):
    judge_kwargs['model'] = 'gpt-4o-mini'
elif listinstr(['MMLongBench', 'MMDU', 'DUDE', 'SLIDEVQA', 'MIA-Bench',
                'WildVision', 'MMAlignBench', 'MM-IFEval'], dataset_name):
    judge_kwargs['model'] = 'gpt-4o'

Default judge configuration from `run.py:352-357`:

judge_kwargs = {
    'nproc': args.api_nproc,
    'verbose': args.verbose,
    'retry': args.retry if args.retry is not None else 3,
    **(json.loads(args.judge_args) if args.judge_args else {}),
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment