Principle:Mlfoundations Open flamingo Visual Question Answering Evaluation
Overview
Evaluation methodology that measures a model's ability to answer questions about images using the official VQA accuracy metric with soft consensus scoring across multiple human annotators.
Description
VQA evaluation presents the model with few-shot demonstrations of (image, question, answer) triples, then asks it to answer a new question about a new image. The model generates a free-form text answer which is scored using the official VQA accuracy metric. This metric uses soft consensus: an answer is scored min(number_of_annotators_who_gave_that_answer / 3, 1), meaning an answer gets full credit if at least 3 of 10 annotators agreed. Supports VQAv2, OK-VQA, VizWiz, and TextVQA benchmarks. OK-VQA applies additional lemmatization for answer matching.
Usage
When evaluating a vision-language model's visual understanding and reasoning capabilities across multiple VQA benchmarks.
Theoretical Basis
The VQA accuracy metric accounts for inter-annotator agreement. For each predicted answer a_i, accuracy is computed as:
accuracy = min(count(a_i in annotations) / 3, 1)
This means perfect accuracy requires matching at least 3 annotators. Text normalization (lowercasing, article removal, punctuation handling, number-to-digit conversion) ensures fair comparison. The few-shot format provides example QA pairs to teach the model the expected answer format.