Workflow:Mlfoundations Open flamingo Few Shot Evaluation

Knowledge Sources	OpenFlamingo OpenFlamingo Paper Eval README
Domains	Vision_Language_Models, Model_Evaluation, Benchmarking
Last Updated	2026-02-08 03:30 GMT

Overview

End-to-end process for evaluating an OpenFlamingo model across 8 vision-language benchmarks using few-shot in-context learning with random or retrieval-based demonstration selection.

Description

This workflow covers the complete evaluation pipeline for OpenFlamingo models on standardized benchmarks. The evaluation supports three task types: image captioning (COCO, Flickr-30K), visual question answering (VQAv2, OK-VQA, TextVQA, VizWiz), and classification (ImageNet, Hateful Memes). For each benchmark, the model receives few-shot demonstration examples (0, 4, 8, 16, or 32 shots) either selected randomly or via RICES (Retrieval-based In-Context Example Selection using CLIP similarity). The pipeline handles distributed evaluation across multiple GPUs, metric computation with official evaluation tools, and result aggregation across multiple trials with different random seeds.

Usage

Execute this workflow when you want to benchmark an OpenFlamingo model checkpoint against standard vision-language evaluation suites. You need a trained OpenFlamingo checkpoint and access to the benchmark datasets (images, annotations, and questions). The evaluation can be run on one or more GPUs via SLURM or torchrun.

Execution Steps

Step 1: Prepare Benchmark Datasets

Download and organize the evaluation datasets in the expected directory structure. Each benchmark requires specific image directories and annotation files. Captioning tasks (COCO, Flickr-30K) need Karpathy split JSON files and COCO-format annotation files. VQA tasks need train/test splits of questions and annotations in VQA format. Classification tasks need ImageNet validation set or Hateful Memes dataset with JSON annotations.

Key considerations:

COCO requires train2014 images, val2014 images, Karpathy JSON, and captions annotations
VQAv2 test-dev and test-std require submission to EvalAI (script evaluates on val split)
OK-VQA evaluation requires NLTK WordNet to be downloaded
TextVQA and VizWiz annotations are bundled with the repository in eval/data/
Each dataset has train and test splits; train is used for selecting demonstration examples

Step 2: Cache RICES Features (Optional)

Pre-compute CLIP visual features for all training set images across the desired benchmarks. This step uses the cache_rices_features.py script to encode images through a CLIP vision encoder and save the feature vectors as pickle files. These cached features enable fast retrieval-based selection of in-context examples during evaluation without recomputing features each time.

Key considerations:

RICES uses CLIP similarity to find the most relevant demonstration examples for each query
Caching is optional but recommended for repeated evaluations with different shot counts
Features are saved per-dataset as .pkl files in a specified output directory
The CLIP encoder for RICES can differ from the model's vision encoder

Step 3: Load Model For Evaluation

Initialize the OpenFlamingo model using the evaluation wrapper class. The wrapper loads the model through the standard factory function, then loads the trained checkpoint weights. The model is set to evaluation mode with left-padding for generation. Configure precision settings (amp_bf16 recommended) and distribute across available GPUs.

Key considerations:

The eval wrapper (EvalModel) handles model creation, checkpoint loading, and device placement
Checkpoint can contain either full state dict or filtered trainable-only state dict
The model dynamically resolves which evaluation module to use based on the --model flag
Supports BLIP-2 baselines in addition to OpenFlamingo models (0-shot only)

Step 4: Run Captioning Evaluation

Evaluate on image captioning benchmarks (COCO and/or Flickr-30K). For each shot count and trial seed: select demonstration examples (random or RICES), format few-shot prompts with <image>Output:{caption}<|endofchunk|> template, generate captions using beam search, post-process outputs, and compute CIDEr scores using the official COCO evaluation toolkit. Results are aggregated across trials to compute mean and standard deviation.

Key considerations:

The caption prompt format is "<image>Output:{caption}<|endofchunk|>"
Beam search with 3 beams and max 20 tokens is the default for captioning
Post-processing removes trailing text after sentence boundaries
CIDEr is the primary metric, computed using pycocoevalcap
For 0-shot, image tags are removed from context but text template is preserved

Step 5: Run VQA Evaluation

Evaluate on visual question answering benchmarks (VQAv2, OK-VQA, TextVQA, and/or VizWiz). For each benchmark, shot count, and trial seed: select demonstrations, format prompts with "<image>Question:{question} Short answer:{answer}<|endofchunk|>" template, generate short answers, post-process outputs, and compute VQA accuracy using official evaluation metrics. OK-VQA uses a specialized normalization pipeline.

Key considerations:

The VQA prompt format is "<image>Question:{question} Short answer:{answer}<|endofchunk|>"
Max generation length is 5 tokens for VQA tasks
VQA accuracy uses the official metric with extensive text normalization (number words, contractions, articles, punctuation)
OK-VQA uses an additional post-processing pipeline with lemmatization
VQAv2 test-dev results can be formatted for EvalAI submission using fill_vqa_testdev_results.py

Step 6: Run Classification Evaluation

Evaluate on classification benchmarks (ImageNet and/or Hateful Memes). Instead of generating text, the model computes log-likelihood scores for each candidate class name conditioned on the input image and few-shot context. The class with the highest log-probability is selected as the prediction. ImageNet uses 1000 class names with synonyms, while Hateful Memes is a binary classification task.

Key considerations:

Classification uses get_rank_classifications() which computes log-probabilities over candidate labels
KV-cache is used to avoid recomputing context for each class name (significant speedup)
Prompt ensembling averages log-likelihoods over permutations of in-context examples
ImageNet uses length-normalized log-probabilities; Hateful Memes uses ROC AUC as the metric
Multi-token class names are handled by looping through tokens sequentially

Step 7: Aggregate And Save Results

Collect evaluation results across all benchmarks, shot counts, and trials. Compute mean and standard deviation for each configuration. Save the complete results to a JSON file. Results from distributed workers are gathered using all_gather_object before aggregation on rank 0.

Key considerations:

Results are gathered from all distributed workers before metric computation
The JSON output includes per-trial scores, mean, and standard deviation for each benchmark and shot count
Only rank 0 writes the final results file
Temporary prediction files are cleaned up after metric computation

Execution Diagram

GitHub URL

Workflow Repository