Implementation:OpenGVLab InternVL ScienceQA Evaluation

Knowledge Sources	OpenGVLab_InternVL
Domains	Evaluation, Benchmark, Multiple_Choice
Last Updated	2026-02-07 14:00 GMT

Overview

This script evaluates model predictions on the ScienceQA multiple-choice benchmark by parsing answer options from model text and computing overall and image-specific accuracy.

Description

The eval_science_qa.py script implements the evaluation pipeline for the ScienceQA benchmark. It loads problem definitions from a JSON file, prediction results from a JSONL file, and split indices to filter for the target evaluation split (default: test).

Answer extraction uses a multi-strategy parsing approach:

Direct option letter match (e.g., "A")
Option letter with period prefix (e.g., "A. ")
Regex extraction of "The answer is X." pattern
Fallback to "FAILED" if no pattern matches

The script computes:

Overall accuracy across all questions
Image-specific accuracy (IMG-Accuracy) for questions containing the <image> token, isolating multimodal reasoning performance
Per-question analysis with parsed answer, ground truth, prediction text, and multimodal flag

Results are saved to two JSON files: a detailed analysis with correct/incorrect breakdowns and a summary with SQA-format results compatible with the official leaderboard.

Usage

Use this script to evaluate model outputs on ScienceQA after generating predictions with the model_vqa_science.py inference script. It handles the multiple answer format patterns commonly produced by LLaVA models.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/eval/eval_science_qa.py
Lines: 1-114

Signature

def get_args() -> argparse.Namespace: ...

def convert_caps(results: list) -> list: ...

def get_pred_idx(prediction: str, choices: list, options: list) -> int: ...

Import

# This is a standalone CLI script, not typically imported
# Run via: python eval_science_qa.py --base-dir /path/to/sqa --result-file predictions.jsonl --output-file analysis.json --output-result sqa_results.json

I/O Contract

Inputs

Name	Type	Required	Description
--base-dir	str (dir path)	Yes	ScienceQA base directory containing problems.json and pid_splits.json
--result-file	str (file path)	Yes	Path to JSONL file with model predictions
--output-file	str (file path)	Yes	Path for detailed JSON analysis output (correct/incorrect breakdowns)
--output-result	str (file path)	Yes	Path for SQA-format JSON results (acc, count, per-question results)
--split	str	No	Data split to evaluate (default: "test")
--options	list	No	Valid option letters (default: ["A", "B", "C", "D", "E"])

Outputs

Name	Type	Description
output-file	JSON	Detailed analysis with correct and incorrect lists, each containing question_id, parsed_ans, ground_truth, question, pred, and is_multimodal
output-result	JSON	SQA-format results with acc, correct, count, per-question results (indices), and outputs (text)

Usage Examples

Basic Usage

# Command-line execution for ScienceQA evaluation
# python internvl_chat_llava/llava/eval/eval_science_qa.py \
#     --base-dir /path/to/ScienceQA/data/scienceqa \
#     --result-file predictions.jsonl \
#     --output-file analysis.json \
#     --output-result sqa_results.json \
#     --split test

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment