Principle:Mlfoundations Open flamingo Captioning Evaluation

Overview

Evaluation methodology that measures image captioning quality using CIDEr score by comparing generated captions against reference annotations in a few-shot in-context learning setting.

Description

Captioning evaluation provides few-shot demonstration examples (image-caption pairs) as context, then asks the model to generate a caption for a query image. The generated captions are scored against reference annotations using CIDEr (Consensus-based Image Description Evaluation), which measures consensus between the generated caption and multiple reference captions using TF-IDF weighted n-gram matching. Supports COCO and Flickr30K benchmarks. In distributed settings, predictions are gathered from all ranks before metric computation.

Usage

When evaluating a vision-language model's ability to describe images; standard benchmark for captioning quality.

Theoretical Basis

CIDEr uses TF-IDF weighted n-grams to measure how well generated captions match human reference captions. Higher TF-IDF weight is given to n-grams that are specific to the image (high TF) but uncommon across the dataset (high IDF), rewarding descriptive and specific captions. The few-shot setup provides the model with example image-caption pairs as in-context demonstrations, testing the model's ability to learn the captioning format and apply it to new images.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment