Implementation:Haotian liu LLaVA Eval Metrics Suite

Overview

Collection of evaluation metric scripts for computing accuracy across POPE, TextVQA, and ScienceQA benchmarks. Each script loads model answers from JSONL output files and ground truth from benchmark-specific annotation files.

Description

Three evaluation scripts form the local metric computation suite:

eval_pope.py - Computes binary classification metrics (accuracy, precision, recall, F1, yes-ratio) for hallucination detection. Normalizes model outputs to binary yes/no by checking for negation words. Evaluates per POPE category (adversarial, popular, random) by filtering answers based on question category.

eval_textvqa.py - Uses TextVQAAccuracyEvaluator from m4c_evaluator.py for VQA accuracy with answer normalization and 10-answer voting. Includes a prompt_processor() function that extracts the core question text from various OCR-augmented prompt formats.

eval_science_qa.py - Computes exact-match accuracy on multiple-choice ScienceQA questions with per-category breakdowns. Parses model text output to extract option letters using multiple heuristic patterns. Outputs both a detailed analysis JSON and a summary results JSON.

Sources

llava/eval/eval_pope.py:L5-81
llava/eval/eval_textvqa.py:L35-51 (eval_single), L17-32 (prompt_processor)
llava/eval/eval_science_qa.py:L39-114
llava/eval/m4c_evaluator.py:L221-257 (TextVQAAccuracyEvaluator)

API Signatures

eval_pope.py

def eval_pope(answers: list, label_file: str) -> None:
    """
    Evaluate POPE binary classification metrics for a single category.

    Args:
        answers: List of dicts with 'text' field containing model predictions.
                 Text is normalized: words containing 'No'/'not'/'no' -> 'no', else -> 'yes'.
        label_file: Path to POPE ground truth JSON file with 'label' field per line.

    Prints:
        TP, FP, TN, FN counts
        Accuracy, Precision, Recall, F1 score, Yes ratio
        Summary line: "F1, accuracy, precision, recall, yes_ratio"
    """

CLI usage:

python llava/eval/eval_pope.py \
    --annotation-dir ./playground/data/eval/pope/coco \
    --question-file ./playground/data/eval/pope/llava_pope_test.jsonl \
    --result-file ./playground/data/eval/pope/answers/llava-v1.5-13b.jsonl

Argument	Type	Description
`--annotation-dir`	str	Directory containing `coco_pope_*.json` label files
`--question-file`	str	Question JSONL with `question_id` and `category` fields
`--result-file`	str	Model answer JSONL from `model_vqa_loader.py`

The main block iterates over all files matching coco_pope_*.json in the annotation directory, extracts the category name from the filename, filters answers by category, and calls eval_pope() for each.

eval_textvqa.py

def eval_single(annotation_file: str, result_file: str) -> None:
    """
    Evaluate TextVQA accuracy for a single result file.

    Args:
        annotation_file: Path to TextVQA annotation JSON (with 'data' key containing
                        list of annotations with 'image_id', 'question', 'answers')
        result_file: Path to model answer JSONL with 'question_id', 'prompt', 'text'

    Prints:
        Experiment name, sample count, accuracy percentage
    """

def prompt_processor(prompt: str) -> str:
    """
    Extract core question text from OCR-augmented prompts.
    Handles three prompt formats:
    1. 'OCR tokens: ... Question: <q> Short answer:' format
    2. 'Reference OCR token: ...' multi-line format
    3. Simple two-line format (question + instruction)
    Returns lowercased question text.
    """

CLI usage:

python -m llava.eval.eval_textvqa \
    --annotation-file ./playground/data/eval/textvqa/TextVQA_0.5.1_val.json \
    --result-file ./playground/data/eval/textvqa/answers/llava-v1.5-13b.jsonl

Argument	Type	Description
`--annotation-file`	str	TextVQA annotation JSON with ground truth answers
`--result-file`	str	Single model answer JSONL file
`--result-dir`	str	(Alternative) Directory of JSONL files to evaluate all at once

eval_science_qa.py

# Main block (no function wrapper)
# CLI: python llava/eval/eval_science_qa.py --base-dir ... --result-file ... --output-file ... --output-result ...

def get_pred_idx(prediction: str, choices: list, options: list) -> int:
    """
    Convert predicted option letter to index.

    Args:
        prediction: Predicted letter (e.g., 'C')
        choices: List of answer choice texts
        options: List of valid option letters ['A', 'B', 'C', 'D', 'E']

    Returns:
        Index of the predicted option, or -1 if invalid
    """

CLI usage:

python llava/eval/eval_science_qa.py \
    --base-dir ./playground/data/eval/scienceqa \
    --result-file ./playground/data/eval/scienceqa/answers/llava-v1.5-13b.jsonl \
    --output-file ./playground/data/eval/scienceqa/answers/llava-v1.5-13b_output.json \
    --output-result ./playground/data/eval/scienceqa/answers/llava-v1.5-13b_result.json \
    --split test

Argument	Type	Default	Description
`--base-dir`	str	(required)	Directory containing `pid_splits.json` and `problems.json`
`--result-file`	str	(required)	Model answer JSONL file
`--output-file`	str	(required)	Path to write detailed analysis JSON (correct/incorrect lists)
`--output-result`	str	(required)	Path to write summary results JSON (acc, correct, count)
`--split`	str	`test`	Dataset split to evaluate (matches keys in `pid_splits.json`)

Inputs

Model answer JSONL - Output from model_vqa_loader.py with question_id, prompt, text fields
Ground-truth annotation files - Benchmark-specific formats:
- POPE: JSONL with label field (yes/no)
- TextVQA: JSON with data array containing image_id, question, answers (list of 10)
- ScienceQA: problems.json (question data) + pid_splits.json (train/val/test splits)

Outputs

POPE

Printed to stdout per category:

Category: adversarial, # samples: 3000
TP      FP      TN      FN
1423    77      1423    77
Accuracy: 0.9487
Precision: 0.9487
Recall: 0.9487
F1 score: 0.9487
Yes ratio: 0.5
====================================

TextVQA

llava-v1.5-13b
Samples: 5000
Accuracy: 58.21%

ScienceQA

Printed to stdout and written to JSON files:

Total: 4241, Correct: 2973, Accuracy: 70.10%, IMG-Accuracy: 68.45%

The output-result JSON contains:

{
    "acc": 70.10,
    "correct": 2973,
    "count": 4241,
    "results": {"problem_id": predicted_index, ...},
    "outputs": {"problem_id": "raw_model_text", ...}
}

Related Pages

implements Principle:Haotian_liu_LLaVA_Benchmark_Metric_Computation

Metadata

Property	Value
last_updated	2026-02-13 14:00 GMT
page_type	Implementation (API Doc)
workflow	Benchmark_Evaluation
source_files	llava/eval/eval_pope.py, llava/eval/eval_textvqa.py, llava/eval/eval_science_qa.py

Environment:Haotian_liu_LLaVA_OpenAI_API_Evaluation_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment