Implementation:OpenGVLab InternVL GPT Review Visual Evaluation

Knowledge Sources	OpenGVLab_InternVL
Domains	Evaluation, LLM_as_Judge, Visual_QA
Last Updated	2026-02-07 14:00 GMT

Overview

This script uses GPT-4 to evaluate model responses on visual question answering tasks by providing image captions and bounding box annotations as evaluation context.

Description

The eval_gpt_review_visual.py script implements GPT-4-based evaluation for visual content understanding tasks. It extends the standard review pipeline by enriching evaluation prompts with detailed visual context including image captions and object bounding box annotations (category and bbox coordinates).

For each question, the script:

Loads the corresponding image context containing captions (joined with newlines) and instance annotations (formatted as "category: [bbox]")
Constructs a prompt with context, question, both candidate answers, and category-specific evaluation rules
Calls GPT-4-0314 for scoring with retry logic for rate limits
Parses the score pair and writes results to JSONL

This variant supports resume capability by checking existing output length and skipping already-processed entries. It asserts that each question's category must exist in the rule file (no default fallback), enforcing strict category coverage.

Usage

Use this script to evaluate model outputs on visual QA tasks where spatial understanding and object recognition are being assessed, requiring detailed visual annotations as context for the GPT-4 judge.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/eval/eval_gpt_review_visual.py
Lines: 1-118

Signature

def get_eval(content: str, max_tokens: int) -> str: ...

def parse_score(review: str) -> list: ...

Import

# This is a standalone CLI script, not typically imported
# Run via: python eval_gpt_review_visual.py -q questions.jsonl -c context.jsonl -a ans1.jsonl ans2.jsonl -r rules.json -o output.jsonl

I/O Contract

Inputs

Name	Type	Required	Description
-q / --question	str (file path)	Yes	Path to JSONL file with questions (must have image and category fields)
-c / --context	str (file path)	Yes	Path to JSONL file with image contexts containing captions and instance bounding boxes
-a / --answer-list	list of str	Yes	Paths to two JSONL answer files (model and reference)
-r / --rule	str (file path)	Yes	Path to JSON file with evaluation rules (category keys must match exactly)
-o / --output	str (file path)	Yes	Path for the JSONL output review file (supports append/resume)
--max-tokens	int	No	Maximum tokens for GPT-4 output (default: 1024)

Outputs

Name	Type	Description
output file	JSONL	Each line contains id, question_id, answer1_id, answer2_id, category, content (review text), and tuple (score pair)

Usage Examples

Basic Usage

# Command-line execution for visual QA evaluation
# python internvl_chat_llava/llava/eval/eval_gpt_review_visual.py \
#     -q visual_qa_questions.jsonl \
#     -c visual_context.jsonl \
#     -a model_answers.jsonl reference_answers.jsonl \
#     -r visual_rules.json \
#     -o reviews_visual.jsonl

Related Pages

Principle:OpenGVLab_InternVL_GPT_Based_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment