Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL GPT Review Evaluation

From Leeroopedia


Knowledge Sources
Domains Evaluation, LLM_as_Judge, Distributed_Computing
Last Updated 2026-02-07 14:00 GMT

Overview

This script uses GPT-4 with Ray parallelization to evaluate model QA responses by comparing them against reference answers and producing numerical quality scores.

Description

The eval_gpt_review.py script implements a distributed GPT-4-based evaluation pipeline for general question answering tasks. It loads questions and two sets of answers (model and reference) from JSONL files, constructs structured evaluation prompts using category-specific rules from a JSON rule file, and dispatches GPT-4 API calls in parallel via Ray remote functions.

Each evaluation prompt includes the original question, both candidate answers labeled by role (e.g., "Assistant"), and a system prompt specifying the evaluation criteria. The get_eval function is decorated with @ray.remote(num_cpus=4) to distribute API calls across Ray workers, with built-in retry logic for OpenAI rate limit errors.

After all evaluations complete, the parse_score function extracts a pair of numerical scores from the first line of each GPT-4 response. Results are written to a JSONL output file containing the question ID, answer IDs, category, review content, and parsed score tuple.

Usage

Use this script to evaluate model QA outputs at scale using GPT-4 as an automated judge, particularly when processing large evaluation sets that benefit from Ray-based parallelism.

Code Reference

Source Location

Signature

@ray.remote(num_cpus=4)
def get_eval(content: str, max_tokens: int) -> str: ...

def parse_score(review: str) -> list: ...

Import

# This is a standalone CLI script, not typically imported
# Run via: python eval_gpt_review.py -q questions.jsonl -a ans1.jsonl ans2.jsonl -r rules.json -o output.jsonl

I/O Contract

Inputs

Name Type Required Description
-q / --question str (file path) Yes Path to JSONL file with questions
-a / --answer-list list of str Yes Paths to two JSONL answer files (model and reference)
-r / --rule str (file path) Yes Path to JSON file with category-specific evaluation rules
-o / --output str (file path) Yes Path for the JSONL output review file
--max-tokens int No Maximum tokens for GPT-4 output (default: 1024)

Outputs

Name Type Description
output file JSONL Each line contains id, question_id, answer1_id, answer2_id, category, content (review text), and tuple (score pair)

Usage Examples

Basic Usage

# Command-line execution with Ray parallelism
# python internvl_chat_llava/llava/eval/eval_gpt_review.py \
#     -q questions.jsonl \
#     -a model_answers.jsonl reference_answers.jsonl \
#     -r evaluation_rules.json \
#     -o reviews.jsonl \
#     --max-tokens 1024

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment