Implementation:Mlfoundations Open flamingo Evaluate captioning
Overview
Concrete tool for running few-shot captioning evaluation with CIDEr scoring on COCO and Flickr30K benchmarks provided by the OpenFlamingo evaluation module.
Description
The evaluate_captioning() function: (1) loads train and test splits of CaptionDataset, (2) selects few-shot examples (random or RICES), (3) constructs prompts with <image>Output:{caption}<|endofchunk|> format, (4) generates captions via eval_model.get_outputs() with beam search, (5) gathers predictions across distributed ranks, (6) computes CIDEr score via compute_cider() using pycocoevalcap. The companion compute_cider() function wraps pycocoevalcap's COCOEvalCap.
Usage
Called from the main evaluation loop for COCO and Flickr30K captioning benchmarks.
Code Reference
Source: Repository https://github.com/mlfoundations/open_flamingo, File: open_flamingo/eval/evaluate.py Lines L728-896 (evaluate_captioning), open_flamingo/eval/coco_metric.py Lines L1-22 (compute_cider)
Signature:
def evaluate_captioning(
args: argparse.Namespace,
eval_model: BaseEvalModel,
seed: int = 42,
min_generation_length: int = 0,
max_generation_length: int = 20,
num_beams: int = 3,
length_penalty: float = 0.0,
num_shots: int = 8,
dataset_name: str = "coco",
cached_features=None,
) -> float:
"""Returns CIDEr score * 100"""
def compute_cider(result_path: str, annotations_path: str) -> Dict[str, float]:
"""Returns dict with CIDEr, BLEU, METEOR, ROUGE_L, SPICE scores"""
Import:
from open_flamingo.eval.evaluate import evaluate_captioning
from open_flamingo.eval.coco_metric import compute_cider
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | Namespace | Yes | Eval config with dataset paths |
| eval_model | BaseEvalModel | Yes | Model wrapper |
| seed | int | No | Random seed (default 42) |
| num_shots | int | No | Few-shot examples (default 8) |
| dataset_name | str | No | "coco" or "flickr" (default "coco") |
| num_beams | int | No | Beam width (default 3) |
| max_generation_length | int | No | Max caption tokens (default 20) |
| cached_features | Tensor | No | RICES cached features |
Outputs
| Type | Description |
|---|---|
| float | CIDEr score multiplied by 100 |
Usage Examples
# Run 8-shot captioning evaluation on COCO
cider_score = evaluate_captioning(
args=args,
eval_model=eval_model,
seed=42,
num_shots=8,
dataset_name="coco",
num_beams=3,
max_generation_length=20,
)
print(f"COCO CIDEr score: {cider_score:.2f}")