Implementation:Haotian liu LLaVA Eval Metrics Suite
Overview
Collection of evaluation metric scripts for computing accuracy across POPE, TextVQA, and ScienceQA benchmarks. Each script loads model answers from JSONL output files and ground truth from benchmark-specific annotation files.
Description
Three evaluation scripts form the local metric computation suite:
- eval_pope.py - Computes binary classification metrics (accuracy, precision, recall, F1, yes-ratio) for hallucination detection. Normalizes model outputs to binary yes/no by checking for negation words. Evaluates per POPE category (adversarial, popular, random) by filtering answers based on question category.
- eval_textvqa.py - Uses
TextVQAAccuracyEvaluatorfromm4c_evaluator.pyfor VQA accuracy with answer normalization and 10-answer voting. Includes aprompt_processor()function that extracts the core question text from various OCR-augmented prompt formats.
- eval_science_qa.py - Computes exact-match accuracy on multiple-choice ScienceQA questions with per-category breakdowns. Parses model text output to extract option letters using multiple heuristic patterns. Outputs both a detailed analysis JSON and a summary results JSON.
Sources
llava/eval/eval_pope.py:L5-81llava/eval/eval_textvqa.py:L35-51(eval_single),L17-32(prompt_processor)llava/eval/eval_science_qa.py:L39-114llava/eval/m4c_evaluator.py:L221-257(TextVQAAccuracyEvaluator)
API Signatures
eval_pope.py
def eval_pope(answers: list, label_file: str) -> None:
"""
Evaluate POPE binary classification metrics for a single category.
Args:
answers: List of dicts with 'text' field containing model predictions.
Text is normalized: words containing 'No'/'not'/'no' -> 'no', else -> 'yes'.
label_file: Path to POPE ground truth JSON file with 'label' field per line.
Prints:
TP, FP, TN, FN counts
Accuracy, Precision, Recall, F1 score, Yes ratio
Summary line: "F1, accuracy, precision, recall, yes_ratio"
"""
CLI usage:
python llava/eval/eval_pope.py \
--annotation-dir ./playground/data/eval/pope/coco \
--question-file ./playground/data/eval/pope/llava_pope_test.jsonl \
--result-file ./playground/data/eval/pope/answers/llava-v1.5-13b.jsonl
| Argument | Type | Description |
|---|---|---|
--annotation-dir |
str | Directory containing coco_pope_*.json label files
|
--question-file |
str | Question JSONL with question_id and category fields
|
--result-file |
str | Model answer JSONL from model_vqa_loader.py
|
The main block iterates over all files matching coco_pope_*.json in the annotation directory, extracts the category name from the filename, filters answers by category, and calls eval_pope() for each.
eval_textvqa.py
def eval_single(annotation_file: str, result_file: str) -> None:
"""
Evaluate TextVQA accuracy for a single result file.
Args:
annotation_file: Path to TextVQA annotation JSON (with 'data' key containing
list of annotations with 'image_id', 'question', 'answers')
result_file: Path to model answer JSONL with 'question_id', 'prompt', 'text'
Prints:
Experiment name, sample count, accuracy percentage
"""
def prompt_processor(prompt: str) -> str:
"""
Extract core question text from OCR-augmented prompts.
Handles three prompt formats:
1. 'OCR tokens: ... Question: <q> Short answer:' format
2. 'Reference OCR token: ...' multi-line format
3. Simple two-line format (question + instruction)
Returns lowercased question text.
"""
CLI usage:
python -m llava.eval.eval_textvqa \
--annotation-file ./playground/data/eval/textvqa/TextVQA_0.5.1_val.json \
--result-file ./playground/data/eval/textvqa/answers/llava-v1.5-13b.jsonl
| Argument | Type | Description |
|---|---|---|
--annotation-file |
str | TextVQA annotation JSON with ground truth answers |
--result-file |
str | Single model answer JSONL file |
--result-dir |
str | (Alternative) Directory of JSONL files to evaluate all at once |
eval_science_qa.py
# Main block (no function wrapper)
# CLI: python llava/eval/eval_science_qa.py --base-dir ... --result-file ... --output-file ... --output-result ...
def get_pred_idx(prediction: str, choices: list, options: list) -> int:
"""
Convert predicted option letter to index.
Args:
prediction: Predicted letter (e.g., 'C')
choices: List of answer choice texts
options: List of valid option letters ['A', 'B', 'C', 'D', 'E']
Returns:
Index of the predicted option, or -1 if invalid
"""
CLI usage:
python llava/eval/eval_science_qa.py \
--base-dir ./playground/data/eval/scienceqa \
--result-file ./playground/data/eval/scienceqa/answers/llava-v1.5-13b.jsonl \
--output-file ./playground/data/eval/scienceqa/answers/llava-v1.5-13b_output.json \
--output-result ./playground/data/eval/scienceqa/answers/llava-v1.5-13b_result.json \
--split test
| Argument | Type | Default | Description |
|---|---|---|---|
--base-dir |
str | (required) | Directory containing pid_splits.json and problems.json
|
--result-file |
str | (required) | Model answer JSONL file |
--output-file |
str | (required) | Path to write detailed analysis JSON (correct/incorrect lists) |
--output-result |
str | (required) | Path to write summary results JSON (acc, correct, count) |
--split |
str | test |
Dataset split to evaluate (matches keys in pid_splits.json)
|
Inputs
- Model answer JSONL - Output from
model_vqa_loader.pywithquestion_id,prompt,textfields - Ground-truth annotation files - Benchmark-specific formats:
- POPE: JSONL with
labelfield (yes/no) - TextVQA: JSON with
dataarray containingimage_id,question,answers(list of 10) - ScienceQA:
problems.json(question data) +pid_splits.json(train/val/test splits)
- POPE: JSONL with
Outputs
POPE
Printed to stdout per category:
Category: adversarial, # samples: 3000
TP FP TN FN
1423 77 1423 77
Accuracy: 0.9487
Precision: 0.9487
Recall: 0.9487
F1 score: 0.9487
Yes ratio: 0.5
====================================
TextVQA
llava-v1.5-13b
Samples: 5000
Accuracy: 58.21%
ScienceQA
Printed to stdout and written to JSON files:
Total: 4241, Correct: 2973, Accuracy: 70.10%, IMG-Accuracy: 68.45%
The output-result JSON contains:
{
"acc": 70.10,
"correct": 2973,
"count": 4241,
"results": {"problem_id": predicted_index, ...},
"outputs": {"problem_id": "raw_model_text", ...}
}
Related Pages
Metadata
| Property | Value |
|---|---|
| last_updated | 2026-02-13 14:00 GMT |
| page_type | Implementation (API Doc) |
| workflow | Benchmark_Evaluation |
| source_files | llava/eval/eval_pope.py, llava/eval/eval_textvqa.py, llava/eval/eval_science_qa.py |