Principle:Openai CLIP Top K Accuracy Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Classification, Vision |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
An evaluation protocol that measures classification performance by computing the percentage of test samples whose true label appears among the model's top-K highest-scoring predictions.
Description
Top-K Accuracy Evaluation is the standard metric for assessing image classification systems, particularly on large-scale benchmarks like ImageNet. For each test image, the model produces a ranking of all classes by their predicted scores. Top-1 accuracy checks if the highest-scoring class matches the true label. Top-5 accuracy checks if the true label appears anywhere in the top 5 predictions.
In the CLIP prompt-engineering workflow, the evaluation pipeline consists of:
- Logit computation: Multiply L2-normalized image features by the zero-shot classifier weight matrix with a temperature scalar (100.0) to produce per-class logits.
- Top-K extraction: Use torch.topk() to find the K highest-scoring class indices for each image.
- Correctness check: Compare the top-K predictions against the ground truth labels.
- Accuracy aggregation: Average the correctness across all test images, computing running means per batch.
The temperature scalar of 100.0 applied to the cosine similarity logits is standard in CLIP evaluation and matches the learned logit_scale from training.
Usage
Use this principle when benchmarking CLIP zero-shot classification performance on standard datasets. Report both top-1 and top-5 accuracy for comparison with published results.
Theoretical Basis
Top-K accuracy measures classification quality with increasing leniency:
# Top-K accuracy computation
# logits: [B, num_classes] = image_features @ zeroshot_weights
# target: [B] = ground truth class indices
# 1. Find top-K predictions
_, pred = logits.topk(K, dim=1, largest=True, sorted=True)
# pred: [B, K]
# 2. Check if true label is in top-K
pred = pred.t() # [K, B]
correct = pred.eq(target.view(1, -1).expand_as(pred)) # [K, B] boolean
# 3. Compute accuracy for each K value
for k in [1, 5]:
correct_k = correct[:k].reshape(-1).float().sum()
accuracy_k = correct_k / batch_size * 100.0
Expected results (CLIP paper, ViT-B/32 on ImageNetV2 with 80-template ensemble):
- Top-1: ~55.93%
- Top-5: ~83.36%