Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL GPT Review Visual Evaluation

From Leeroopedia
Revision as of 16:14, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/OpenGVLab_InternVL_GPT_Review_Visual_Evaluation.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Evaluation, LLM_as_Judge, Visual_QA
Last Updated 2026-02-07 14:00 GMT

Overview

This script uses GPT-4 to evaluate model responses on visual question answering tasks by providing image captions and bounding box annotations as evaluation context.

Description

The eval_gpt_review_visual.py script implements GPT-4-based evaluation for visual content understanding tasks. It extends the standard review pipeline by enriching evaluation prompts with detailed visual context including image captions and object bounding box annotations (category and bbox coordinates).

For each question, the script:

  1. Loads the corresponding image context containing captions (joined with newlines) and instance annotations (formatted as "category: [bbox]")
  2. Constructs a prompt with context, question, both candidate answers, and category-specific evaluation rules
  3. Calls GPT-4-0314 for scoring with retry logic for rate limits
  4. Parses the score pair and writes results to JSONL

This variant supports resume capability by checking existing output length and skipping already-processed entries. It asserts that each question's category must exist in the rule file (no default fallback), enforcing strict category coverage.

Usage

Use this script to evaluate model outputs on visual QA tasks where spatial understanding and object recognition are being assessed, requiring detailed visual annotations as context for the GPT-4 judge.

Code Reference

Source Location

Signature

def get_eval(content: str, max_tokens: int) -> str: ...

def parse_score(review: str) -> list: ...

Import

# This is a standalone CLI script, not typically imported
# Run via: python eval_gpt_review_visual.py -q questions.jsonl -c context.jsonl -a ans1.jsonl ans2.jsonl -r rules.json -o output.jsonl

I/O Contract

Inputs

Name Type Required Description
-q / --question str (file path) Yes Path to JSONL file with questions (must have image and category fields)
-c / --context str (file path) Yes Path to JSONL file with image contexts containing captions and instance bounding boxes
-a / --answer-list list of str Yes Paths to two JSONL answer files (model and reference)
-r / --rule str (file path) Yes Path to JSON file with evaluation rules (category keys must match exactly)
-o / --output str (file path) Yes Path for the JSONL output review file (supports append/resume)
--max-tokens int No Maximum tokens for GPT-4 output (default: 1024)

Outputs

Name Type Description
output file JSONL Each line contains id, question_id, answer1_id, answer2_id, category, content (review text), and tuple (score pair)

Usage Examples

Basic Usage

# Command-line execution for visual QA evaluation
# python internvl_chat_llava/llava/eval/eval_gpt_review_visual.py \
#     -q visual_qa_questions.jsonl \
#     -c visual_context.jsonl \
#     -a model_answers.jsonl reference_answers.jsonl \
#     -r visual_rules.json \
#     -o reviews_visual.jsonl

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment