Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Haotian liu LLaVA Eval Metrics Suite

From Leeroopedia

Overview

Collection of evaluation metric scripts for computing accuracy across POPE, TextVQA, and ScienceQA benchmarks. Each script loads model answers from JSONL output files and ground truth from benchmark-specific annotation files.

Description

Three evaluation scripts form the local metric computation suite:

  • eval_pope.py - Computes binary classification metrics (accuracy, precision, recall, F1, yes-ratio) for hallucination detection. Normalizes model outputs to binary yes/no by checking for negation words. Evaluates per POPE category (adversarial, popular, random) by filtering answers based on question category.
  • eval_textvqa.py - Uses TextVQAAccuracyEvaluator from m4c_evaluator.py for VQA accuracy with answer normalization and 10-answer voting. Includes a prompt_processor() function that extracts the core question text from various OCR-augmented prompt formats.
  • eval_science_qa.py - Computes exact-match accuracy on multiple-choice ScienceQA questions with per-category breakdowns. Parses model text output to extract option letters using multiple heuristic patterns. Outputs both a detailed analysis JSON and a summary results JSON.

Sources

  • llava/eval/eval_pope.py:L5-81
  • llava/eval/eval_textvqa.py:L35-51 (eval_single), L17-32 (prompt_processor)
  • llava/eval/eval_science_qa.py:L39-114
  • llava/eval/m4c_evaluator.py:L221-257 (TextVQAAccuracyEvaluator)

API Signatures

eval_pope.py

def eval_pope(answers: list, label_file: str) -> None:
    """
    Evaluate POPE binary classification metrics for a single category.

    Args:
        answers: List of dicts with 'text' field containing model predictions.
                 Text is normalized: words containing 'No'/'not'/'no' -> 'no', else -> 'yes'.
        label_file: Path to POPE ground truth JSON file with 'label' field per line.

    Prints:
        TP, FP, TN, FN counts
        Accuracy, Precision, Recall, F1 score, Yes ratio
        Summary line: "F1, accuracy, precision, recall, yes_ratio"
    """

CLI usage:

python llava/eval/eval_pope.py \
    --annotation-dir ./playground/data/eval/pope/coco \
    --question-file ./playground/data/eval/pope/llava_pope_test.jsonl \
    --result-file ./playground/data/eval/pope/answers/llava-v1.5-13b.jsonl
Argument Type Description
--annotation-dir str Directory containing coco_pope_*.json label files
--question-file str Question JSONL with question_id and category fields
--result-file str Model answer JSONL from model_vqa_loader.py

The main block iterates over all files matching coco_pope_*.json in the annotation directory, extracts the category name from the filename, filters answers by category, and calls eval_pope() for each.

eval_textvqa.py

def eval_single(annotation_file: str, result_file: str) -> None:
    """
    Evaluate TextVQA accuracy for a single result file.

    Args:
        annotation_file: Path to TextVQA annotation JSON (with 'data' key containing
                        list of annotations with 'image_id', 'question', 'answers')
        result_file: Path to model answer JSONL with 'question_id', 'prompt', 'text'

    Prints:
        Experiment name, sample count, accuracy percentage
    """

def prompt_processor(prompt: str) -> str:
    """
    Extract core question text from OCR-augmented prompts.
    Handles three prompt formats:
    1. 'OCR tokens: ... Question: <q> Short answer:' format
    2. 'Reference OCR token: ...' multi-line format
    3. Simple two-line format (question + instruction)
    Returns lowercased question text.
    """

CLI usage:

python -m llava.eval.eval_textvqa \
    --annotation-file ./playground/data/eval/textvqa/TextVQA_0.5.1_val.json \
    --result-file ./playground/data/eval/textvqa/answers/llava-v1.5-13b.jsonl
Argument Type Description
--annotation-file str TextVQA annotation JSON with ground truth answers
--result-file str Single model answer JSONL file
--result-dir str (Alternative) Directory of JSONL files to evaluate all at once

eval_science_qa.py

# Main block (no function wrapper)
# CLI: python llava/eval/eval_science_qa.py --base-dir ... --result-file ... --output-file ... --output-result ...

def get_pred_idx(prediction: str, choices: list, options: list) -> int:
    """
    Convert predicted option letter to index.

    Args:
        prediction: Predicted letter (e.g., 'C')
        choices: List of answer choice texts
        options: List of valid option letters ['A', 'B', 'C', 'D', 'E']

    Returns:
        Index of the predicted option, or -1 if invalid
    """

CLI usage:

python llava/eval/eval_science_qa.py \
    --base-dir ./playground/data/eval/scienceqa \
    --result-file ./playground/data/eval/scienceqa/answers/llava-v1.5-13b.jsonl \
    --output-file ./playground/data/eval/scienceqa/answers/llava-v1.5-13b_output.json \
    --output-result ./playground/data/eval/scienceqa/answers/llava-v1.5-13b_result.json \
    --split test
Argument Type Default Description
--base-dir str (required) Directory containing pid_splits.json and problems.json
--result-file str (required) Model answer JSONL file
--output-file str (required) Path to write detailed analysis JSON (correct/incorrect lists)
--output-result str (required) Path to write summary results JSON (acc, correct, count)
--split str test Dataset split to evaluate (matches keys in pid_splits.json)

Inputs

  • Model answer JSONL - Output from model_vqa_loader.py with question_id, prompt, text fields
  • Ground-truth annotation files - Benchmark-specific formats:
    • POPE: JSONL with label field (yes/no)
    • TextVQA: JSON with data array containing image_id, question, answers (list of 10)
    • ScienceQA: problems.json (question data) + pid_splits.json (train/val/test splits)

Outputs

POPE

Printed to stdout per category:

Category: adversarial, # samples: 3000
TP      FP      TN      FN
1423    77      1423    77
Accuracy: 0.9487
Precision: 0.9487
Recall: 0.9487
F1 score: 0.9487
Yes ratio: 0.5
====================================

TextVQA

llava-v1.5-13b
Samples: 5000
Accuracy: 58.21%

ScienceQA

Printed to stdout and written to JSON files:

Total: 4241, Correct: 2973, Accuracy: 70.10%, IMG-Accuracy: 68.45%

The output-result JSON contains:

{
    "acc": 70.10,
    "correct": 2973,
    "count": 4241,
    "results": {"problem_id": predicted_index, ...},
    "outputs": {"problem_id": "raw_model_text", ...}
}

Related Pages

Metadata

Property Value
last_updated 2026-02-13 14:00 GMT
page_type Implementation (API Doc)
workflow Benchmark_Evaluation
source_files llava/eval/eval_pope.py, llava/eval/eval_textvqa.py, llava/eval/eval_science_qa.py

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment