Principle:Haotian liu LLaVA Benchmark Metric Computation
Overview
Automated computation of evaluation metrics (accuracy, F1, precision, recall) for multimodal benchmark results, enabling quantitative assessment of LLaVA model performance.
Description
LLaVA implements metric computation for several benchmarks that can be evaluated locally without requiring online server submission. Each metric script loads model answers and ground truth annotations, normalizes answers where appropriate, and computes standard classification or accuracy metrics.
The three primary metric computation pipelines are:
POPE Hallucination Evaluation
POPE (Polling-based Object Probing Evaluation) evaluates object hallucination in vision-language models. The evaluation computes binary classification metrics by comparing model yes/no predictions against ground truth labels. Model outputs are first normalized: text containing "No", "not", or "no" is mapped to no; all other responses are mapped to yes. Metrics are computed per POPE category (adversarial, popular, random), with each category testing different types of hallucination.
TextVQA Accuracy
TextVQA evaluation uses the M4C TextVQAAccuracyEvaluator, which implements the standard VQA accuracy protocol with 10-answer voting. For each question, 10 human reference answers are provided. The accuracy for a predicted answer is computed as min(count_of_matching_humans / 3, 1.0), averaged across all leave-one-out combinations. Both predicted and reference answers are normalized via EvalAIAnswerProcessor before comparison.
ScienceQA Accuracy
ScienceQA evaluation computes exact match accuracy on multiple-choice questions. The model output is parsed to extract the predicted option letter (A-E) using multiple heuristics: direct letter match, "A. " prefix format, and regex pattern "The answer is X." extraction. Results include overall accuracy and an image-question accuracy (IMG-Accuracy) computed only on questions that contain image context.
Usage
Run these metric scripts after batch inference to compute quantitative metrics for POPE, TextVQA, and ScienceQA benchmarks. These three benchmarks support local evaluation; other benchmarks (VQAv2, MMBench, VizWiz) require online server submission and do not have local metric scripts.
Metric scripts are typically invoked at the end of evaluation shell scripts:
# POPE (from pope.sh)
python llava/eval/eval_pope.py \
--annotation-dir ./playground/data/eval/pope/coco \
--question-file ./playground/data/eval/pope/llava_pope_test.jsonl \
--result-file ./playground/data/eval/pope/answers/llava-v1.5-13b.jsonl
# TextVQA (from textvqa.sh)
python -m llava.eval.eval_textvqa \
--annotation-file ./playground/data/eval/textvqa/TextVQA_0.5.1_val.json \
--result-file ./playground/data/eval/textvqa/answers/llava-v1.5-13b.jsonl
# ScienceQA (from sqa.sh)
python llava/eval/eval_science_qa.py \
--base-dir ./playground/data/eval/scienceqa \
--result-file ./playground/data/eval/scienceqa/answers/llava-v1.5-13b.jsonl \
--output-file ./playground/data/eval/scienceqa/answers/output.json \
--output-result ./playground/data/eval/scienceqa/answers/result.json
Theoretical Basis
POPE: Binary Classification Metrics
| Metric | Formula | Description |
|---|---|---|
| Accuracy | (TP + TN) / (TP + TN + FP + FN) |
Overall correctness of yes/no predictions |
| Precision | TP / (TP + FP) |
Proportion of "yes" predictions that are correct |
| Recall | TP / (TP + FN) |
Proportion of actual "yes" labels correctly predicted |
| F1 Score | 2 * precision * recall / (precision + recall) |
Harmonic mean of precision and recall |
| Yes Ratio | count(pred=yes) / total |
Proportion of "yes" predictions (bias indicator) |
Where pos=1 (yes) and neg=0 (no). A yes ratio near 0.5 indicates unbiased predictions; significant deviation suggests hallucination tendency.
TextVQA: VQA Accuracy with 10-Answer Voting
The VQA accuracy protocol uses 10 human reference answers per question:
- Normalize both predicted answer and all 10 reference answers via
EvalAIAnswerProcessor - For each unique answer, compute accuracy using leave-one-out: for each of the 10 ground truth positions, count how many of the other 9 match the candidate answer
- Score =
min(matching_count / 3, 1.0), averaged across all 10 positions - Final accuracy = mean score across all questions
ScienceQA: Exact Match with Category Breakdown
Multiple-choice answer extraction follows a priority chain:
- Direct option letter match (e.g., "A")
- Option prefix format (e.g., "A. ")
- Regex extraction from natural language (e.g., "The answer is A.")
- Fallback to "FAILED" if no pattern matches
Accuracy is computed as exact match between predicted and ground truth option indices, with separate reporting for image-containing questions (IMG-Accuracy).
Knowledge Sources
- Paper - POPE - Evaluating Object Hallucination in Large Vision-Language Models
- Repo - LLaVA - https://github.com/haotian-liu/LLaVA
Domains
- Evaluation
- Metrics
Related Pages
Metadata
| Property | Value |
|---|---|
| last_updated | 2026-02-13 14:00 GMT |
| page_type | Principle |
| workflow | Benchmark_Evaluation |