Implementation:OpenGVLab InternVL GPT Review Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, LLM_as_Judge, Distributed_Computing |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This script uses GPT-4 with Ray parallelization to evaluate model QA responses by comparing them against reference answers and producing numerical quality scores.
Description
The eval_gpt_review.py script implements a distributed GPT-4-based evaluation pipeline for general question answering tasks. It loads questions and two sets of answers (model and reference) from JSONL files, constructs structured evaluation prompts using category-specific rules from a JSON rule file, and dispatches GPT-4 API calls in parallel via Ray remote functions.
Each evaluation prompt includes the original question, both candidate answers labeled by role (e.g., "Assistant"), and a system prompt specifying the evaluation criteria. The get_eval function is decorated with @ray.remote(num_cpus=4) to distribute API calls across Ray workers, with built-in retry logic for OpenAI rate limit errors.
After all evaluations complete, the parse_score function extracts a pair of numerical scores from the first line of each GPT-4 response. Results are written to a JSONL output file containing the question ID, answer IDs, category, review content, and parsed score tuple.
Usage
Use this script to evaluate model QA outputs at scale using GPT-4 as an automated judge, particularly when processing large evaluation sets that benefit from Ray-based parallelism.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/eval/eval_gpt_review.py
- Lines: 1-113
Signature
@ray.remote(num_cpus=4)
def get_eval(content: str, max_tokens: int) -> str: ...
def parse_score(review: str) -> list: ...
Import
# This is a standalone CLI script, not typically imported
# Run via: python eval_gpt_review.py -q questions.jsonl -a ans1.jsonl ans2.jsonl -r rules.json -o output.jsonl
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -q / --question | str (file path) | Yes | Path to JSONL file with questions |
| -a / --answer-list | list of str | Yes | Paths to two JSONL answer files (model and reference) |
| -r / --rule | str (file path) | Yes | Path to JSON file with category-specific evaluation rules |
| -o / --output | str (file path) | Yes | Path for the JSONL output review file |
| --max-tokens | int | No | Maximum tokens for GPT-4 output (default: 1024) |
Outputs
| Name | Type | Description |
|---|---|---|
| output file | JSONL | Each line contains id, question_id, answer1_id, answer2_id, category, content (review text), and tuple (score pair) |
Usage Examples
Basic Usage
# Command-line execution with Ray parallelism
# python internvl_chat_llava/llava/eval/eval_gpt_review.py \
# -q questions.jsonl \
# -a model_answers.jsonl reference_answers.jsonl \
# -r evaluation_rules.json \
# -o reviews.jsonl \
# --max-tokens 1024