Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL GPT Review Bench Evaluation

From Leeroopedia


Knowledge Sources
Domains Evaluation, LLM_as_Judge, Benchmark
Last Updated 2026-02-07 14:00 GMT

Overview

This script uses GPT-4 to evaluate model responses on the LLaVA-Bench (In-the-Wild) benchmark by comparing model answers against reference answers with image caption context.

Description

The eval_gpt_review_bench.py script implements GPT-4-based evaluation specifically tailored for the LLaVA-Bench benchmark. It loads questions, two answer files, and a context file containing image captions. For each question, it constructs a prompt that includes the image caption context, the question, both candidate answers, and category-specific evaluation criteria prefixed with llava_bench_.

Unlike the general review script, this variant:

  • Uses GPT-4-0314 (a specific snapshot model) for reproducible evaluations
  • Includes image caption context in the evaluation prompt to provide visual grounding
  • Supports resume capability by loading existing reviews and skipping already-evaluated entries (appends to output file)
  • Processes evaluations sequentially rather than with Ray parallelism

The get_eval function calls the OpenAI API with retry logic for rate limits, and parse_score extracts numerical score pairs from GPT-4 responses.

Usage

Use this script to evaluate LLaVA model outputs on the LLaVA-Bench (In-the-Wild) benchmark, where image caption context is essential for accurate evaluation of visual understanding capabilities.

Code Reference

Source Location

Signature

def get_eval(content: str, max_tokens: int) -> str: ...

def parse_score(review: str) -> list: ...

Import

# This is a standalone CLI script, not typically imported
# Run via: python eval_gpt_review_bench.py -q questions.jsonl -c context.jsonl -a ans1.jsonl ans2.jsonl -r rules.json -o output.jsonl

I/O Contract

Inputs

Name Type Required Description
-q / --question str (file path) Yes Path to JSONL file with questions
-c / --context str (file path) Yes Path to JSONL file with image captions keyed by image filename
-a / --answer-list list of str Yes Paths to two JSONL answer files (model and reference)
-r / --rule str (file path) Yes Path to JSON file with evaluation rules (keys prefixed with llava_bench_)
-o / --output str (file path) Yes Path for the JSONL output review file (supports append/resume)
--max-tokens int No Maximum tokens for GPT-4 output (default: 1024)

Outputs

Name Type Description
output file JSONL Each line contains id, question_id, answer1_id, answer2_id, category, content (review text), and tuple (score pair)

Usage Examples

Basic Usage

# Command-line execution for LLaVA-Bench evaluation
# python internvl_chat_llava/llava/eval/eval_gpt_review_bench.py \
#     -q llava_bench_questions.jsonl \
#     -c llava_bench_context.jsonl \
#     -a model_answers.jsonl reference_answers.jsonl \
#     -r llava_bench_rules.json \
#     -o reviews_bench.jsonl

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment