Principle:Mlfoundations Open flamingo Classification Evaluation
Overview
Evaluation methodology that measures classification accuracy by scoring all candidate class names using language model log-probabilities conditioned on visual context.
Description
Classification evaluation uses the vision-language model as a discriminative classifier by computing the log-probability of each class name given the visual context and few-shot demonstrations. For each test image, the model scores every class name (1000 for ImageNet, 2 for Hateful Memes) by computing the sum of log-probabilities of the class name tokens. The class with the highest log-probability is selected as the prediction. KV-cache optimization avoids re-encoding the shared prompt prefix for each class. Prompt ensembling averages scores across permutations of in-context examples.
Usage
When evaluating on classification benchmarks (ImageNet, Hateful Memes) where the model must select from predefined categories.
Theoretical Basis
Language model classification works by treating class names as possible completions:
- P(class|image, demos) ∝ Π P(token_i | tokens_{<i}, image, demos)
Log-probability scoring avoids open-ended generation bias. Length normalization divides the total log-probability by token count to prevent bias toward shorter class names. KV-cache optimization stores the key-value pairs from the shared prompt prefix and reuses them for each class completion, reducing computation from O(C * L) to O(C * K) where C=classes, L=prompt length, K=class name length.
Related Pages
Implementation:Mlfoundations_Open_flamingo_Evaluate_classification