Implementation:Mlfoundations Open flamingo Evaluate vqa

Overview

Concrete tool for running few-shot VQA evaluation with official VQA accuracy scoring across four benchmarks provided by the OpenFlamingo evaluation module.

Description

The evaluate_vqa() function:

Loads train and test VQADataset splits
Selects few-shot examples (random or RICES)
Constructs prompts with "<image>Question:{q} Short answer:{a}<|endofchunk|>" format
Generates answers via eval_model.get_outputs() with beam search (max 5 tokens)
Post-processes answers (extract text before "Question"/"Answer" tokens)
Gathers predictions across ranks
Computes VQA accuracy via compute_vqa_accuracy()

Handles dataset-specific paths and formats for VQAv2, OK-VQA, VizWiz, and TextVQA.

Usage

Called from the main evaluation loop for VQA benchmarks.

Code Reference

Source

Repository: https://github.com/mlfoundations/open_flamingo
File: open_flamingo/eval/evaluate.py Lines L899-1115 (evaluate_vqa)
File: open_flamingo/eval/vqa_metric.py Lines L527-560 (compute_vqa_accuracy)

Signature

def evaluate_vqa(
    args: argparse.Namespace,
    eval_model: BaseEvalModel,
    seed: int = 42,
    min_generation_length: int = 0,
    max_generation_length: int = 5,
    num_beams: int = 3,
    length_penalty: float = 0.0,
    num_shots: int = 8,
    dataset_name: str = "vqav2",
    cached_features=None,
) -> float:
    """Returns VQA accuracy score"""

def compute_vqa_accuracy(
    result_json_path: str,
    question_json_path: str,
    annotation_json_path: str,
) -> float:
    """Returns overall VQA accuracy"""

Import

from open_flamingo.eval.evaluate import evaluate_vqa
from open_flamingo.eval.vqa_metric import compute_vqa_accuracy

I/O Contract

Inputs

Name	Type	Required	Description
args	`argparse.Namespace`	Yes	Eval config with dataset paths
eval_model	`BaseEvalModel`	Yes	Model wrapper
seed	`int`	No	Random seed (default 42)
num_shots	`int`	No	Number of few-shot examples (default 8)
dataset_name	`str`	No	One of "vqav2", "ok_vqa", "vizwiz", "textvqa"
max_generation_length	`int`	No	Maximum tokens to generate (default 5)
cached_features	`Tensor`	No	RICES features for retrieval-based example selection

Outputs

Type	Description
`float`	VQA accuracy score

Usage Examples

# Run 8-shot VQA evaluation on VQAv2
accuracy = evaluate_vqa(
    args=args,
    eval_model=eval_model,
    seed=42,
    num_shots=8,
    dataset_name="vqav2",
)
print(f"VQAv2 8-shot accuracy: {accuracy:.4f}")

Environment:Mlfoundations_Open_flamingo_Evaluation_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment