Principle:OpenGVLab InternVL VQA Accuracy Scoring
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Vision_Language, Metrics |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A family of evaluation metrics for visual question answering that measure model prediction accuracy using soft scoring, exact matching, and edit-distance-based methods.
Description
VQA evaluation uses multiple scoring approaches depending on the benchmark:
- VQA soft accuracy: The standard VQA metric where each question has 10 human-provided ground truth answers. Accuracy for a prediction is min(1, count_of_matching_answers / 3), reflecting inter-annotator agreement.
- Exact match: Binary scoring — the prediction either matches any ground truth answer exactly, or it does not.
- Relaxed accuracy: Allows 5% relative numerical tolerance for math/chart questions.
- ANLS (Average Normalized Levenshtein Similarity): Edit-distance-based scoring for OCR-heavy benchmarks (InfographicsVQA, DocVQA), with a threshold of 0.5.
Answer normalization is critical: predictions and ground truths are lowercased, punctuation-stripped, article-removed, and number-word-converted before comparison.
Usage
Use VQA soft accuracy for standard VQA benchmarks (TextVQA, VQAv2, OKVQA). Use ANLS for document understanding benchmarks (InfographicsVQA, DocVQA). Use relaxed accuracy for chart/math benchmarks (ChartQA).
Theoretical Basis
VQA soft accuracy (per question):
Where is the set of 10 ground truth answers and is the model prediction.
ANLS (per question):
Where .