Principle:OpenGVLab InternVL GPT Based Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, LLM_as_Judge, Multimodal |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
GPT-Based Evaluation uses a large language model (typically GPT-4) as an automated judge to score and compare model-generated answers against reference responses on question-answering and visual understanding benchmarks.
Description
This principle describes the practice of leveraging GPT-4 as an automated evaluator for open-ended model responses where traditional exact-match metrics are insufficient. The evaluation pipeline sends structured prompts to the GPT-4 API containing the original question, two candidate answers (the model under evaluation and a reference), and category-specific evaluation rules. GPT-4 returns a numerical score pair (one for each answer) which is parsed from the first line of the response.
The approach supports category-specific evaluation rules loaded from a JSON rule file, enabling different scoring criteria for different question types (e.g., conversation, detail, complex reasoning). Results are written to JSONL format for downstream aggregation.
Key design aspects include:
- Retry logic with rate-limit handling to gracefully handle OpenAI API throttling
- Resume capability by checking existing output files and skipping already-evaluated entries
- Context enrichment with visual descriptions (captions, bounding boxes) for visual QA tasks
- Parallel execution via Ray for large-scale evaluations
Usage
Use this principle when implementing automated evaluation of open-ended model responses on benchmarks like LLaVA-Bench, where human-like judgment is needed to assess answer quality beyond simple string matching.
Theoretical Basis
GPT-4 as a judge follows the "LLM-as-a-Judge" paradigm described in the LLaVA evaluation methodology. Research has shown that strong LLMs can provide evaluations that correlate well with human judgments for open-ended QA tasks. The pairwise comparison format (scoring both a candidate and reference answer) enables relative quality assessment while reducing position bias through structured prompting.