Implementation:OpenGVLab InternVL ScienceQA Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Benchmark, Multiple_Choice |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This script evaluates model predictions on the ScienceQA multiple-choice benchmark by parsing answer options from model text and computing overall and image-specific accuracy.
Description
The eval_science_qa.py script implements the evaluation pipeline for the ScienceQA benchmark. It loads problem definitions from a JSON file, prediction results from a JSONL file, and split indices to filter for the target evaluation split (default: test).
Answer extraction uses a multi-strategy parsing approach:
- Direct option letter match (e.g., "A")
- Option letter with period prefix (e.g., "A. ")
- Regex extraction of "The answer is X." pattern
- Fallback to "FAILED" if no pattern matches
The script computes:
- Overall accuracy across all questions
- Image-specific accuracy (IMG-Accuracy) for questions containing the
<image>token, isolating multimodal reasoning performance - Per-question analysis with parsed answer, ground truth, prediction text, and multimodal flag
Results are saved to two JSON files: a detailed analysis with correct/incorrect breakdowns and a summary with SQA-format results compatible with the official leaderboard.
Usage
Use this script to evaluate model outputs on ScienceQA after generating predictions with the model_vqa_science.py inference script. It handles the multiple answer format patterns commonly produced by LLaVA models.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/eval/eval_science_qa.py
- Lines: 1-114
Signature
def get_args() -> argparse.Namespace: ...
def convert_caps(results: list) -> list: ...
def get_pred_idx(prediction: str, choices: list, options: list) -> int: ...
Import
# This is a standalone CLI script, not typically imported
# Run via: python eval_science_qa.py --base-dir /path/to/sqa --result-file predictions.jsonl --output-file analysis.json --output-result sqa_results.json
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --base-dir | str (dir path) | Yes | ScienceQA base directory containing problems.json and pid_splits.json |
| --result-file | str (file path) | Yes | Path to JSONL file with model predictions |
| --output-file | str (file path) | Yes | Path for detailed JSON analysis output (correct/incorrect breakdowns) |
| --output-result | str (file path) | Yes | Path for SQA-format JSON results (acc, count, per-question results) |
| --split | str | No | Data split to evaluate (default: "test") |
| --options | list | No | Valid option letters (default: ["A", "B", "C", "D", "E"]) |
Outputs
| Name | Type | Description |
|---|---|---|
| output-file | JSON | Detailed analysis with correct and incorrect lists, each containing question_id, parsed_ans, ground_truth, question, pred, and is_multimodal |
| output-result | JSON | SQA-format results with acc, correct, count, per-question results (indices), and outputs (text) |
Usage Examples
Basic Usage
# Command-line execution for ScienceQA evaluation
# python internvl_chat_llava/llava/eval/eval_science_qa.py \
# --base-dir /path/to/ScienceQA/data/scienceqa \
# --result-file predictions.jsonl \
# --output-file analysis.json \
# --output-result sqa_results.json \
# --split test