Implementation:OpenGVLab InternVL GPT Review Bench Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, LLM_as_Judge, Benchmark |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This script uses GPT-4 to evaluate model responses on the LLaVA-Bench (In-the-Wild) benchmark by comparing model answers against reference answers with image caption context.
Description
The eval_gpt_review_bench.py script implements GPT-4-based evaluation specifically tailored for the LLaVA-Bench benchmark. It loads questions, two answer files, and a context file containing image captions. For each question, it constructs a prompt that includes the image caption context, the question, both candidate answers, and category-specific evaluation criteria prefixed with llava_bench_.
Unlike the general review script, this variant:
- Uses GPT-4-0314 (a specific snapshot model) for reproducible evaluations
- Includes image caption context in the evaluation prompt to provide visual grounding
- Supports resume capability by loading existing reviews and skipping already-evaluated entries (appends to output file)
- Processes evaluations sequentially rather than with Ray parallelism
The get_eval function calls the OpenAI API with retry logic for rate limits, and parse_score extracts numerical score pairs from GPT-4 responses.
Usage
Use this script to evaluate LLaVA model outputs on the LLaVA-Bench (In-the-Wild) benchmark, where image caption context is essential for accurate evaluation of visual understanding capabilities.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/eval/eval_gpt_review_bench.py
- Lines: 1-121
Signature
def get_eval(content: str, max_tokens: int) -> str: ...
def parse_score(review: str) -> list: ...
Import
# This is a standalone CLI script, not typically imported
# Run via: python eval_gpt_review_bench.py -q questions.jsonl -c context.jsonl -a ans1.jsonl ans2.jsonl -r rules.json -o output.jsonl
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -q / --question | str (file path) | Yes | Path to JSONL file with questions |
| -c / --context | str (file path) | Yes | Path to JSONL file with image captions keyed by image filename |
| -a / --answer-list | list of str | Yes | Paths to two JSONL answer files (model and reference) |
| -r / --rule | str (file path) | Yes | Path to JSON file with evaluation rules (keys prefixed with llava_bench_) |
| -o / --output | str (file path) | Yes | Path for the JSONL output review file (supports append/resume) |
| --max-tokens | int | No | Maximum tokens for GPT-4 output (default: 1024) |
Outputs
| Name | Type | Description |
|---|---|---|
| output file | JSONL | Each line contains id, question_id, answer1_id, answer2_id, category, content (review text), and tuple (score pair) |
Usage Examples
Basic Usage
# Command-line execution for LLaVA-Bench evaluation
# python internvl_chat_llava/llava/eval/eval_gpt_review_bench.py \
# -q llava_bench_questions.jsonl \
# -c llava_bench_context.jsonl \
# -a model_answers.jsonl reference_answers.jsonl \
# -r llava_bench_rules.json \
# -o reviews_bench.jsonl