Principle:Open compass VLMEvalKit Judge LLM Construction
| Field | Value |
|---|---|
| Source | Repo |
| Domain | Vision, Evaluation, NLP |
Overview
A factory pattern that constructs LLM judge models for evaluating open-ended VLM predictions by mapping shorthand judge names to API wrappers.
Description
Many VLM evaluation tasks require a secondary LLM to judge the quality of model predictions (e.g., for open-ended VQA, answer extraction from free-form text). VLMEvalKit provides build_judge() which maps shorthand judge names (e.g., "chatgpt-0125", "gpt-4o") to specific model versions and wraps them in API client classes (OpenAIWrapper, SiliconFlowAPI, HFChatModel). The judge model is used for tasks like MCQ answer extraction fallback, VQA scoring, and multi-dimensional evaluation.
Usage
Use whenever evaluation requires LLM-based judgment. Called internally by dataset.evaluate() methods. Requires API keys to be configured via load_env().
Theoretical Basis
LLM-as-judge paradigm — using a strong language model to evaluate outputs of another model. The judge maps are versioned for reproducibility. The factory pattern abstracts provider differences, allowing seamless switching between OpenAI, SiliconFlow, and local Hugging Face models through a unified interface.