Principle:Mlfoundations Open flamingo Captioning Evaluation
Overview
Evaluation methodology that measures image captioning quality using CIDEr score by comparing generated captions against reference annotations in a few-shot in-context learning setting.
Description
Captioning evaluation provides few-shot demonstration examples (image-caption pairs) as context, then asks the model to generate a caption for a query image. The generated captions are scored against reference annotations using CIDEr (Consensus-based Image Description Evaluation), which measures consensus between the generated caption and multiple reference captions using TF-IDF weighted n-gram matching. Supports COCO and Flickr30K benchmarks. In distributed settings, predictions are gathered from all ranks before metric computation.
Usage
When evaluating a vision-language model's ability to describe images; standard benchmark for captioning quality.
Theoretical Basis
CIDEr uses TF-IDF weighted n-grams to measure how well generated captions match human reference captions. Higher TF-IDF weight is given to n-grams that are specific to the image (high TF) but uncommon across the dataset (high IDF), rewarding descriptive and specific captions. The few-shot setup provides the model with example image-caption pairs as in-context demonstrations, testing the model's ability to learn the captioning format and apply it to new images.
Related Pages
Implementation:Mlfoundations_Open_flamingo_Evaluate_captioning