Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenGVLab InternVL GPT Based Evaluation

From Leeroopedia


Knowledge Sources
Domains Evaluation, LLM_as_Judge, Multimodal
Last Updated 2026-02-07 14:00 GMT

Overview

GPT-Based Evaluation uses a large language model (typically GPT-4) as an automated judge to score and compare model-generated answers against reference responses on question-answering and visual understanding benchmarks.

Description

This principle describes the practice of leveraging GPT-4 as an automated evaluator for open-ended model responses where traditional exact-match metrics are insufficient. The evaluation pipeline sends structured prompts to the GPT-4 API containing the original question, two candidate answers (the model under evaluation and a reference), and category-specific evaluation rules. GPT-4 returns a numerical score pair (one for each answer) which is parsed from the first line of the response.

The approach supports category-specific evaluation rules loaded from a JSON rule file, enabling different scoring criteria for different question types (e.g., conversation, detail, complex reasoning). Results are written to JSONL format for downstream aggregation.

Key design aspects include:

  • Retry logic with rate-limit handling to gracefully handle OpenAI API throttling
  • Resume capability by checking existing output files and skipping already-evaluated entries
  • Context enrichment with visual descriptions (captions, bounding boxes) for visual QA tasks
  • Parallel execution via Ray for large-scale evaluations

Usage

Use this principle when implementing automated evaluation of open-ended model responses on benchmarks like LLaVA-Bench, where human-like judgment is needed to assess answer quality beyond simple string matching.

Theoretical Basis

GPT-4 as a judge follows the "LLM-as-a-Judge" paradigm described in the LLaVA evaluation methodology. Research has shown that strong LLMs can provide evaluations that correlate well with human judgments for open-ended QA tasks. The pairwise comparison format (scoring both a candidate and reference answer) enables relative quality assessment while reducing position bias through structured prompting.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment