Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL ScienceQA Evaluation

From Leeroopedia


Knowledge Sources
Domains Evaluation, Benchmark, Multiple_Choice
Last Updated 2026-02-07 14:00 GMT

Overview

This script evaluates model predictions on the ScienceQA multiple-choice benchmark by parsing answer options from model text and computing overall and image-specific accuracy.

Description

The eval_science_qa.py script implements the evaluation pipeline for the ScienceQA benchmark. It loads problem definitions from a JSON file, prediction results from a JSONL file, and split indices to filter for the target evaluation split (default: test).

Answer extraction uses a multi-strategy parsing approach:

  1. Direct option letter match (e.g., "A")
  2. Option letter with period prefix (e.g., "A. ")
  3. Regex extraction of "The answer is X." pattern
  4. Fallback to "FAILED" if no pattern matches

The script computes:

  • Overall accuracy across all questions
  • Image-specific accuracy (IMG-Accuracy) for questions containing the <image> token, isolating multimodal reasoning performance
  • Per-question analysis with parsed answer, ground truth, prediction text, and multimodal flag

Results are saved to two JSON files: a detailed analysis with correct/incorrect breakdowns and a summary with SQA-format results compatible with the official leaderboard.

Usage

Use this script to evaluate model outputs on ScienceQA after generating predictions with the model_vqa_science.py inference script. It handles the multiple answer format patterns commonly produced by LLaVA models.

Code Reference

Source Location

Signature

def get_args() -> argparse.Namespace: ...

def convert_caps(results: list) -> list: ...

def get_pred_idx(prediction: str, choices: list, options: list) -> int: ...

Import

# This is a standalone CLI script, not typically imported
# Run via: python eval_science_qa.py --base-dir /path/to/sqa --result-file predictions.jsonl --output-file analysis.json --output-result sqa_results.json

I/O Contract

Inputs

Name Type Required Description
--base-dir str (dir path) Yes ScienceQA base directory containing problems.json and pid_splits.json
--result-file str (file path) Yes Path to JSONL file with model predictions
--output-file str (file path) Yes Path for detailed JSON analysis output (correct/incorrect breakdowns)
--output-result str (file path) Yes Path for SQA-format JSON results (acc, count, per-question results)
--split str No Data split to evaluate (default: "test")
--options list No Valid option letters (default: ["A", "B", "C", "D", "E"])

Outputs

Name Type Description
output-file JSON Detailed analysis with correct and incorrect lists, each containing question_id, parsed_ans, ground_truth, question, pred, and is_multimodal
output-result JSON SQA-format results with acc, correct, count, per-question results (indices), and outputs (text)

Usage Examples

Basic Usage

# Command-line execution for ScienceQA evaluation
# python internvl_chat_llava/llava/eval/eval_science_qa.py \
#     --base-dir /path/to/ScienceQA/data/scienceqa \
#     --result-file predictions.jsonl \
#     --output-file analysis.json \
#     --output-result sqa_results.json \
#     --split test

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment