Principle:Open compass VLMEvalKit Judge LLM Construction

Field	Value
Source	Repo
Domain	Vision, Evaluation, NLP

Overview

A factory pattern that constructs LLM judge models for evaluating open-ended VLM predictions by mapping shorthand judge names to API wrappers.

Description

Many VLM evaluation tasks require a secondary LLM to judge the quality of model predictions (e.g., for open-ended VQA, answer extraction from free-form text). VLMEvalKit provides build_judge() which maps shorthand judge names (e.g., "chatgpt-0125", "gpt-4o") to specific model versions and wraps them in API client classes (OpenAIWrapper, SiliconFlowAPI, HFChatModel). The judge model is used for tasks like MCQ answer extraction fallback, VQA scoring, and multi-dimensional evaluation.

Usage

Use whenever evaluation requires LLM-based judgment. Called internally by dataset.evaluate() methods. Requires API keys to be configured via load_env().

Theoretical Basis

LLM-as-judge paradigm — using a strong language model to evaluate outputs of another model. The judge maps are versioned for reproducibility. The factory pattern abstracts provider differences, allowing seamless switching between OpenAI, SiliconFlow, and local Hugging Face models through a unified interface.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment