Implementation:OpenGVLab InternVL GPT Review Visual Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, LLM_as_Judge, Visual_QA |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This script uses GPT-4 to evaluate model responses on visual question answering tasks by providing image captions and bounding box annotations as evaluation context.
Description
The eval_gpt_review_visual.py script implements GPT-4-based evaluation for visual content understanding tasks. It extends the standard review pipeline by enriching evaluation prompts with detailed visual context including image captions and object bounding box annotations (category and bbox coordinates).
For each question, the script:
- Loads the corresponding image context containing captions (joined with newlines) and instance annotations (formatted as "category: [bbox]")
- Constructs a prompt with context, question, both candidate answers, and category-specific evaluation rules
- Calls GPT-4-0314 for scoring with retry logic for rate limits
- Parses the score pair and writes results to JSONL
This variant supports resume capability by checking existing output length and skipping already-processed entries. It asserts that each question's category must exist in the rule file (no default fallback), enforcing strict category coverage.
Usage
Use this script to evaluate model outputs on visual QA tasks where spatial understanding and object recognition are being assessed, requiring detailed visual annotations as context for the GPT-4 judge.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/eval/eval_gpt_review_visual.py
- Lines: 1-118
Signature
def get_eval(content: str, max_tokens: int) -> str: ...
def parse_score(review: str) -> list: ...
Import
# This is a standalone CLI script, not typically imported
# Run via: python eval_gpt_review_visual.py -q questions.jsonl -c context.jsonl -a ans1.jsonl ans2.jsonl -r rules.json -o output.jsonl
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -q / --question | str (file path) | Yes | Path to JSONL file with questions (must have image and category fields) |
| -c / --context | str (file path) | Yes | Path to JSONL file with image contexts containing captions and instance bounding boxes |
| -a / --answer-list | list of str | Yes | Paths to two JSONL answer files (model and reference) |
| -r / --rule | str (file path) | Yes | Path to JSON file with evaluation rules (category keys must match exactly) |
| -o / --output | str (file path) | Yes | Path for the JSONL output review file (supports append/resume) |
| --max-tokens | int | No | Maximum tokens for GPT-4 output (default: 1024) |
Outputs
| Name | Type | Description |
|---|---|---|
| output file | JSONL | Each line contains id, question_id, answer1_id, answer2_id, category, content (review text), and tuple (score pair) |
Usage Examples
Basic Usage
# Command-line execution for visual QA evaluation
# python internvl_chat_llava/llava/eval/eval_gpt_review_visual.py \
# -q visual_qa_questions.jsonl \
# -c visual_context.jsonl \
# -a model_answers.jsonl reference_answers.jsonl \
# -r visual_rules.json \
# -o reviews_visual.jsonl