Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Haotian liu LLaVA Eval Metrics Suite

From Leeroopedia
Revision as of 12:55, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Haotian_liu_LLaVA_Eval_Metrics_Suite.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Collection of evaluation metric scripts for computing accuracy across POPE, TextVQA, and ScienceQA benchmarks. Each script loads model answers from JSONL output files and ground truth from benchmark-specific annotation files.

Description

Three evaluation scripts form the local metric computation suite:

  • eval_pope.py - Computes binary classification metrics (accuracy, precision, recall, F1, yes-ratio) for hallucination detection. Normalizes model outputs to binary yes/no by checking for negation words. Evaluates per POPE category (adversarial, popular, random) by filtering answers based on question category.
  • eval_textvqa.py - Uses TextVQAAccuracyEvaluator from m4c_evaluator.py for VQA accuracy with answer normalization and 10-answer voting. Includes a prompt_processor() function that extracts the core question text from various OCR-augmented prompt formats.
  • eval_science_qa.py - Computes exact-match accuracy on multiple-choice ScienceQA questions with per-category breakdowns. Parses model text output to extract option letters using multiple heuristic patterns. Outputs both a detailed analysis JSON and a summary results JSON.

Sources

  • llava/eval/eval_pope.py:L5-81
  • llava/eval/eval_textvqa.py:L35-51 (eval_single), L17-32 (prompt_processor)
  • llava/eval/eval_science_qa.py:L39-114
  • llava/eval/m4c_evaluator.py:L221-257 (TextVQAAccuracyEvaluator)

API Signatures

eval_pope.py

def eval_pope(answers: list, label_file: str) -> None:
    """
    Evaluate POPE binary classification metrics for a single category.

    Args:
        answers: List of dicts with 'text' field containing model predictions.
                 Text is normalized: words containing 'No'/'not'/'no' -> 'no', else -> 'yes'.
        label_file: Path to POPE ground truth JSON file with 'label' field per line.

    Prints:
        TP, FP, TN, FN counts
        Accuracy, Precision, Recall, F1 score, Yes ratio
        Summary line: "F1, accuracy, precision, recall, yes_ratio"
    """

CLI usage:

python llava/eval/eval_pope.py \
    --annotation-dir ./playground/data/eval/pope/coco \
    --question-file ./playground/data/eval/pope/llava_pope_test.jsonl \
    --result-file ./playground/data/eval/pope/answers/llava-v1.5-13b.jsonl
Argument Type Description
--annotation-dir str Directory containing coco_pope_*.json label files
--question-file str Question JSONL with question_id and category fields
--result-file str Model answer JSONL from model_vqa_loader.py

The main block iterates over all files matching coco_pope_*.json in the annotation directory, extracts the category name from the filename, filters answers by category, and calls eval_pope() for each.

eval_textvqa.py

def eval_single(annotation_file: str, result_file: str) -> None:
    """
    Evaluate TextVQA accuracy for a single result file.

    Args:
        annotation_file: Path to TextVQA annotation JSON (with 'data' key containing
                        list of annotations with 'image_id', 'question', 'answers')
        result_file: Path to model answer JSONL with 'question_id', 'prompt', 'text'

    Prints:
        Experiment name, sample count, accuracy percentage
    """

def prompt_processor(prompt: str) -> str:
    """
    Extract core question text from OCR-augmented prompts.
    Handles three prompt formats:
    1. 'OCR tokens: ... Question: <q> Short answer:' format
    2. 'Reference OCR token: ...' multi-line format
    3. Simple two-line format (question + instruction)
    Returns lowercased question text.
    """

CLI usage:

python -m llava.eval.eval_textvqa \
    --annotation-file ./playground/data/eval/textvqa/TextVQA_0.5.1_val.json \
    --result-file ./playground/data/eval/textvqa/answers/llava-v1.5-13b.jsonl
Argument Type Description
--annotation-file str TextVQA annotation JSON with ground truth answers
--result-file str Single model answer JSONL file
--result-dir str (Alternative) Directory of JSONL files to evaluate all at once

eval_science_qa.py

# Main block (no function wrapper)
# CLI: python llava/eval/eval_science_qa.py --base-dir ... --result-file ... --output-file ... --output-result ...

def get_pred_idx(prediction: str, choices: list, options: list) -> int:
    """
    Convert predicted option letter to index.

    Args:
        prediction: Predicted letter (e.g., 'C')
        choices: List of answer choice texts
        options: List of valid option letters ['A', 'B', 'C', 'D', 'E']

    Returns:
        Index of the predicted option, or -1 if invalid
    """

CLI usage:

python llava/eval/eval_science_qa.py \
    --base-dir ./playground/data/eval/scienceqa \
    --result-file ./playground/data/eval/scienceqa/answers/llava-v1.5-13b.jsonl \
    --output-file ./playground/data/eval/scienceqa/answers/llava-v1.5-13b_output.json \
    --output-result ./playground/data/eval/scienceqa/answers/llava-v1.5-13b_result.json \
    --split test
Argument Type Default Description
--base-dir str (required) Directory containing pid_splits.json and problems.json
--result-file str (required) Model answer JSONL file
--output-file str (required) Path to write detailed analysis JSON (correct/incorrect lists)
--output-result str (required) Path to write summary results JSON (acc, correct, count)
--split str test Dataset split to evaluate (matches keys in pid_splits.json)

Inputs

  • Model answer JSONL - Output from model_vqa_loader.py with question_id, prompt, text fields
  • Ground-truth annotation files - Benchmark-specific formats:
    • POPE: JSONL with label field (yes/no)
    • TextVQA: JSON with data array containing image_id, question, answers (list of 10)
    • ScienceQA: problems.json (question data) + pid_splits.json (train/val/test splits)

Outputs

POPE

Printed to stdout per category:

Category: adversarial, # samples: 3000
TP      FP      TN      FN
1423    77      1423    77
Accuracy: 0.9487
Precision: 0.9487
Recall: 0.9487
F1 score: 0.9487
Yes ratio: 0.5
====================================

TextVQA

llava-v1.5-13b
Samples: 5000
Accuracy: 58.21%

ScienceQA

Printed to stdout and written to JSON files:

Total: 4241, Correct: 2973, Accuracy: 70.10%, IMG-Accuracy: 68.45%

The output-result JSON contains:

{
    "acc": 70.10,
    "correct": 2973,
    "count": 4241,
    "results": {"problem_id": predicted_index, ...},
    "outputs": {"problem_id": "raw_model_text", ...}
}

Related Pages

Metadata

Property Value
last_updated 2026-02-13 14:00 GMT
page_type Implementation (API Doc)
workflow Benchmark_Evaluation
source_files llava/eval/eval_pope.py, llava/eval/eval_textvqa.py, llava/eval/eval_science_qa.py

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment