Principle:OpenGVLab InternVL GPT Based Evaluation

Knowledge Sources	OpenGVLab_InternVL
Domains	Evaluation, LLM_as_Judge, Multimodal
Last Updated	2026-02-07 14:00 GMT

Overview

GPT-Based Evaluation uses a large language model (typically GPT-4) as an automated judge to score and compare model-generated answers against reference responses on question-answering and visual understanding benchmarks.

Description

This principle describes the practice of leveraging GPT-4 as an automated evaluator for open-ended model responses where traditional exact-match metrics are insufficient. The evaluation pipeline sends structured prompts to the GPT-4 API containing the original question, two candidate answers (the model under evaluation and a reference), and category-specific evaluation rules. GPT-4 returns a numerical score pair (one for each answer) which is parsed from the first line of the response.

The approach supports category-specific evaluation rules loaded from a JSON rule file, enabling different scoring criteria for different question types (e.g., conversation, detail, complex reasoning). Results are written to JSONL format for downstream aggregation.

Key design aspects include:

Retry logic with rate-limit handling to gracefully handle OpenAI API throttling
Resume capability by checking existing output files and skipping already-evaluated entries
Context enrichment with visual descriptions (captions, bounding boxes) for visual QA tasks
Parallel execution via Ray for large-scale evaluations

Usage

Use this principle when implementing automated evaluation of open-ended model responses on benchmarks like LLaVA-Bench, where human-like judgment is needed to assess answer quality beyond simple string matching.

Theoretical Basis

GPT-4 as a judge follows the "LLM-as-a-Judge" paradigm described in the LLaVA evaluation methodology. Research has shown that strong LLMs can provide evaluations that correlate well with human judgments for open-ended QA tasks. The pairwise comparison format (scoring both a candidate and reference answer) enables relative quality assessment while reducing position bias through structured prompting.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment