Principle:Mlfoundations Open flamingo Visual Question Answering Evaluation

Overview

Evaluation methodology that measures a model's ability to answer questions about images using the official VQA accuracy metric with soft consensus scoring across multiple human annotators.

Description

VQA evaluation presents the model with few-shot demonstrations of (image, question, answer) triples, then asks it to answer a new question about a new image. The model generates a free-form text answer which is scored using the official VQA accuracy metric. This metric uses soft consensus: an answer is scored min(number_of_annotators_who_gave_that_answer / 3, 1), meaning an answer gets full credit if at least 3 of 10 annotators agreed. Supports VQAv2, OK-VQA, VizWiz, and TextVQA benchmarks. OK-VQA applies additional lemmatization for answer matching.

Usage

When evaluating a vision-language model's visual understanding and reasoning capabilities across multiple VQA benchmarks.

Theoretical Basis

The VQA accuracy metric accounts for inter-annotator agreement. For each predicted answer a_i, accuracy is computed as:

accuracy = min(count(a_i in annotations) / 3, 1)

This means perfect accuracy requires matching at least 3 annotators. Text normalization (lowercasing, article removal, punctuation handling, number-to-digit conversion) ensures fair comparison. The few-shot format provides example QA pairs to teach the model the expected answer format.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment