Workflow:Mlfoundations Open flamingo Few Shot Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language_Models, Model_Evaluation, Benchmarking |
| Last Updated | 2026-02-08 03:30 GMT |
Overview
End-to-end process for evaluating an OpenFlamingo model across 8 vision-language benchmarks using few-shot in-context learning with random or retrieval-based demonstration selection.
Description
This workflow covers the complete evaluation pipeline for OpenFlamingo models on standardized benchmarks. The evaluation supports three task types: image captioning (COCO, Flickr-30K), visual question answering (VQAv2, OK-VQA, TextVQA, VizWiz), and classification (ImageNet, Hateful Memes). For each benchmark, the model receives few-shot demonstration examples (0, 4, 8, 16, or 32 shots) either selected randomly or via RICES (Retrieval-based In-Context Example Selection using CLIP similarity). The pipeline handles distributed evaluation across multiple GPUs, metric computation with official evaluation tools, and result aggregation across multiple trials with different random seeds.
Usage
Execute this workflow when you want to benchmark an OpenFlamingo model checkpoint against standard vision-language evaluation suites. You need a trained OpenFlamingo checkpoint and access to the benchmark datasets (images, annotations, and questions). The evaluation can be run on one or more GPUs via SLURM or torchrun.
Execution Steps
Step 1: Prepare Benchmark Datasets
Download and organize the evaluation datasets in the expected directory structure. Each benchmark requires specific image directories and annotation files. Captioning tasks (COCO, Flickr-30K) need Karpathy split JSON files and COCO-format annotation files. VQA tasks need train/test splits of questions and annotations in VQA format. Classification tasks need ImageNet validation set or Hateful Memes dataset with JSON annotations.
Key considerations:
- COCO requires train2014 images, val2014 images, Karpathy JSON, and captions annotations
- VQAv2 test-dev and test-std require submission to EvalAI (script evaluates on val split)
- OK-VQA evaluation requires NLTK WordNet to be downloaded
- TextVQA and VizWiz annotations are bundled with the repository in eval/data/
- Each dataset has train and test splits; train is used for selecting demonstration examples
Step 2: Cache RICES Features (Optional)
Pre-compute CLIP visual features for all training set images across the desired benchmarks. This step uses the cache_rices_features.py script to encode images through a CLIP vision encoder and save the feature vectors as pickle files. These cached features enable fast retrieval-based selection of in-context examples during evaluation without recomputing features each time.
Key considerations:
- RICES uses CLIP similarity to find the most relevant demonstration examples for each query
- Caching is optional but recommended for repeated evaluations with different shot counts
- Features are saved per-dataset as .pkl files in a specified output directory
- The CLIP encoder for RICES can differ from the model's vision encoder
Step 3: Load Model For Evaluation
Initialize the OpenFlamingo model using the evaluation wrapper class. The wrapper loads the model through the standard factory function, then loads the trained checkpoint weights. The model is set to evaluation mode with left-padding for generation. Configure precision settings (amp_bf16 recommended) and distribute across available GPUs.
Key considerations:
- The eval wrapper (EvalModel) handles model creation, checkpoint loading, and device placement
- Checkpoint can contain either full state dict or filtered trainable-only state dict
- The model dynamically resolves which evaluation module to use based on the --model flag
- Supports BLIP-2 baselines in addition to OpenFlamingo models (0-shot only)
Step 4: Run Captioning Evaluation
Evaluate on image captioning benchmarks (COCO and/or Flickr-30K). For each shot count and trial seed: select demonstration examples (random or RICES), format few-shot prompts with <image>Output:{caption}<|endofchunk|> template, generate captions using beam search, post-process outputs, and compute CIDEr scores using the official COCO evaluation toolkit. Results are aggregated across trials to compute mean and standard deviation.
Key considerations:
- The caption prompt format is "<image>Output:{caption}<|endofchunk|>"
- Beam search with 3 beams and max 20 tokens is the default for captioning
- Post-processing removes trailing text after sentence boundaries
- CIDEr is the primary metric, computed using pycocoevalcap
- For 0-shot, image tags are removed from context but text template is preserved
Step 5: Run VQA Evaluation
Evaluate on visual question answering benchmarks (VQAv2, OK-VQA, TextVQA, and/or VizWiz). For each benchmark, shot count, and trial seed: select demonstrations, format prompts with "<image>Question:{question} Short answer:{answer}<|endofchunk|>" template, generate short answers, post-process outputs, and compute VQA accuracy using official evaluation metrics. OK-VQA uses a specialized normalization pipeline.
Key considerations:
- The VQA prompt format is "<image>Question:{question} Short answer:{answer}<|endofchunk|>"
- Max generation length is 5 tokens for VQA tasks
- VQA accuracy uses the official metric with extensive text normalization (number words, contractions, articles, punctuation)
- OK-VQA uses an additional post-processing pipeline with lemmatization
- VQAv2 test-dev results can be formatted for EvalAI submission using fill_vqa_testdev_results.py
Step 6: Run Classification Evaluation
Evaluate on classification benchmarks (ImageNet and/or Hateful Memes). Instead of generating text, the model computes log-likelihood scores for each candidate class name conditioned on the input image and few-shot context. The class with the highest log-probability is selected as the prediction. ImageNet uses 1000 class names with synonyms, while Hateful Memes is a binary classification task.
Key considerations:
- Classification uses get_rank_classifications() which computes log-probabilities over candidate labels
- KV-cache is used to avoid recomputing context for each class name (significant speedup)
- Prompt ensembling averages log-likelihoods over permutations of in-context examples
- ImageNet uses length-normalized log-probabilities; Hateful Memes uses ROC AUC as the metric
- Multi-token class names are handled by looping through tokens sequentially
Step 7: Aggregate And Save Results
Collect evaluation results across all benchmarks, shot counts, and trials. Compute mean and standard deviation for each configuration. Save the complete results to a JSON file. Results from distributed workers are gathered using all_gather_object before aggregation on rank 0.
Key considerations:
- Results are gathered from all distributed workers before metric computation
- The JSON output includes per-trial scores, mean, and standard deviation for each benchmark and shot count
- Only rank 0 writes the final results file
- Temporary prediction files are cleaned up after metric computation