Implementation:OpenGVLab InternVL GPT Review Bench Evaluation

Knowledge Sources	OpenGVLab_InternVL
Domains	Evaluation, LLM_as_Judge, Benchmark
Last Updated	2026-02-07 14:00 GMT

Overview

This script uses GPT-4 to evaluate model responses on the LLaVA-Bench (In-the-Wild) benchmark by comparing model answers against reference answers with image caption context.

Description

The eval_gpt_review_bench.py script implements GPT-4-based evaluation specifically tailored for the LLaVA-Bench benchmark. It loads questions, two answer files, and a context file containing image captions. For each question, it constructs a prompt that includes the image caption context, the question, both candidate answers, and category-specific evaluation criteria prefixed with llava_bench_.

Unlike the general review script, this variant:

Uses GPT-4-0314 (a specific snapshot model) for reproducible evaluations
Includes image caption context in the evaluation prompt to provide visual grounding
Supports resume capability by loading existing reviews and skipping already-evaluated entries (appends to output file)
Processes evaluations sequentially rather than with Ray parallelism

The get_eval function calls the OpenAI API with retry logic for rate limits, and parse_score extracts numerical score pairs from GPT-4 responses.

Usage

Use this script to evaluate LLaVA model outputs on the LLaVA-Bench (In-the-Wild) benchmark, where image caption context is essential for accurate evaluation of visual understanding capabilities.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/eval/eval_gpt_review_bench.py
Lines: 1-121

Signature

def get_eval(content: str, max_tokens: int) -> str: ...

def parse_score(review: str) -> list: ...

Import

# This is a standalone CLI script, not typically imported
# Run via: python eval_gpt_review_bench.py -q questions.jsonl -c context.jsonl -a ans1.jsonl ans2.jsonl -r rules.json -o output.jsonl

I/O Contract

Inputs

Name	Type	Required	Description
-q / --question	str (file path)	Yes	Path to JSONL file with questions
-c / --context	str (file path)	Yes	Path to JSONL file with image captions keyed by image filename
-a / --answer-list	list of str	Yes	Paths to two JSONL answer files (model and reference)
-r / --rule	str (file path)	Yes	Path to JSON file with evaluation rules (keys prefixed with llava_bench_)
-o / --output	str (file path)	Yes	Path for the JSONL output review file (supports append/resume)
--max-tokens	int	No	Maximum tokens for GPT-4 output (default: 1024)

Outputs

Name	Type	Description
output file	JSONL	Each line contains id, question_id, answer1_id, answer2_id, category, content (review text), and tuple (score pair)

Usage Examples

Basic Usage

# Command-line execution for LLaVA-Bench evaluation
# python internvl_chat_llava/llava/eval/eval_gpt_review_bench.py \
#     -q llava_bench_questions.jsonl \
#     -c llava_bench_context.jsonl \
#     -a model_answers.jsonl reference_answers.jsonl \
#     -r llava_bench_rules.json \
#     -o reviews_bench.jsonl

Related Pages

Principle:OpenGVLab_InternVL_GPT_Based_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment