Principle:Haotian liu LLaVA Benchmark Metric Computation

Overview

Automated computation of evaluation metrics (accuracy, F1, precision, recall) for multimodal benchmark results, enabling quantitative assessment of LLaVA model performance.

Description

LLaVA implements metric computation for several benchmarks that can be evaluated locally without requiring online server submission. Each metric script loads model answers and ground truth annotations, normalizes answers where appropriate, and computes standard classification or accuracy metrics.

The three primary metric computation pipelines are:

POPE Hallucination Evaluation

POPE (Polling-based Object Probing Evaluation) evaluates object hallucination in vision-language models. The evaluation computes binary classification metrics by comparing model yes/no predictions against ground truth labels. Model outputs are first normalized: text containing "No", "not", or "no" is mapped to no; all other responses are mapped to yes. Metrics are computed per POPE category (adversarial, popular, random), with each category testing different types of hallucination.

TextVQA Accuracy

TextVQA evaluation uses the M4C TextVQAAccuracyEvaluator, which implements the standard VQA accuracy protocol with 10-answer voting. For each question, 10 human reference answers are provided. The accuracy for a predicted answer is computed as min(count_of_matching_humans / 3, 1.0), averaged across all leave-one-out combinations. Both predicted and reference answers are normalized via EvalAIAnswerProcessor before comparison.

ScienceQA Accuracy

ScienceQA evaluation computes exact match accuracy on multiple-choice questions. The model output is parsed to extract the predicted option letter (A-E) using multiple heuristics: direct letter match, "A. " prefix format, and regex pattern "The answer is X." extraction. Results include overall accuracy and an image-question accuracy (IMG-Accuracy) computed only on questions that contain image context.

Usage

Run these metric scripts after batch inference to compute quantitative metrics for POPE, TextVQA, and ScienceQA benchmarks. These three benchmarks support local evaluation; other benchmarks (VQAv2, MMBench, VizWiz) require online server submission and do not have local metric scripts.

Metric scripts are typically invoked at the end of evaluation shell scripts:

# POPE (from pope.sh)
python llava/eval/eval_pope.py \
    --annotation-dir ./playground/data/eval/pope/coco \
    --question-file ./playground/data/eval/pope/llava_pope_test.jsonl \
    --result-file ./playground/data/eval/pope/answers/llava-v1.5-13b.jsonl

# TextVQA (from textvqa.sh)
python -m llava.eval.eval_textvqa \
    --annotation-file ./playground/data/eval/textvqa/TextVQA_0.5.1_val.json \
    --result-file ./playground/data/eval/textvqa/answers/llava-v1.5-13b.jsonl

# ScienceQA (from sqa.sh)
python llava/eval/eval_science_qa.py \
    --base-dir ./playground/data/eval/scienceqa \
    --result-file ./playground/data/eval/scienceqa/answers/llava-v1.5-13b.jsonl \
    --output-file ./playground/data/eval/scienceqa/answers/output.json \
    --output-result ./playground/data/eval/scienceqa/answers/result.json

Theoretical Basis

POPE: Binary Classification Metrics

Metric	Formula	Description
Accuracy	`(TP + TN) / (TP + TN + FP + FN)`	Overall correctness of yes/no predictions
Precision	`TP / (TP + FP)`	Proportion of "yes" predictions that are correct
Recall	`TP / (TP + FN)`	Proportion of actual "yes" labels correctly predicted
F1 Score	`2 * precision * recall / (precision + recall)`	Harmonic mean of precision and recall
Yes Ratio	`count(pred=yes) / total`	Proportion of "yes" predictions (bias indicator)

Where pos=1 (yes) and neg=0 (no). A yes ratio near 0.5 indicates unbiased predictions; significant deviation suggests hallucination tendency.

TextVQA: VQA Accuracy with 10-Answer Voting

The VQA accuracy protocol uses 10 human reference answers per question:

Normalize both predicted answer and all 10 reference answers via EvalAIAnswerProcessor
For each unique answer, compute accuracy using leave-one-out: for each of the 10 ground truth positions, count how many of the other 9 match the candidate answer
Score = min(matching_count / 3, 1.0), averaged across all 10 positions
Final accuracy = mean score across all questions

ScienceQA: Exact Match with Category Breakdown

Multiple-choice answer extraction follows a priority chain:

Direct option letter match (e.g., "A")
Option prefix format (e.g., "A. ")
Regex extraction from natural language (e.g., "The answer is A.")
Fallback to "FAILED" if no pattern matches

Accuracy is computed as exact match between predicted and ground truth option indices, with separate reporting for image-containing questions (IMG-Accuracy).

Knowledge Sources

Paper - POPE - Evaluating Object Hallucination in Large Vision-Language Models
Repo - LLaVA - https://github.com/haotian-liu/LLaVA

Domains

Evaluation
Metrics

Related Pages

Implementation:Haotian_liu_LLaVA_Eval_Metrics_Suite

Metadata

Property	Value
last_updated	2026-02-13 14:00 GMT
page_type	Principle
workflow	Benchmark_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment