Principle:Openai CLIP Top K Accuracy Evaluation

Knowledge Sources	Learning Transferable Visual Models From Natural Language Supervision ImageNet Large Scale Visual Recognition Challenge
Domains	Evaluation, Classification, Vision
Last Updated	2026-02-13 22:00 GMT

Overview

An evaluation protocol that measures classification performance by computing the percentage of test samples whose true label appears among the model's top-K highest-scoring predictions.

Description

Top-K Accuracy Evaluation is the standard metric for assessing image classification systems, particularly on large-scale benchmarks like ImageNet. For each test image, the model produces a ranking of all classes by their predicted scores. Top-1 accuracy checks if the highest-scoring class matches the true label. Top-5 accuracy checks if the true label appears anywhere in the top 5 predictions.

In the CLIP prompt-engineering workflow, the evaluation pipeline consists of:

Logit computation: Multiply L2-normalized image features by the zero-shot classifier weight matrix with a temperature scalar (100.0) to produce per-class logits.
Top-K extraction: Use torch.topk() to find the K highest-scoring class indices for each image.
Correctness check: Compare the top-K predictions against the ground truth labels.
Accuracy aggregation: Average the correctness across all test images, computing running means per batch.

The temperature scalar of 100.0 applied to the cosine similarity logits is standard in CLIP evaluation and matches the learned logit_scale from training.

Usage

Use this principle when benchmarking CLIP zero-shot classification performance on standard datasets. Report both top-1 and top-5 accuracy for comparison with published results.

Theoretical Basis

Top-K accuracy measures classification quality with increasing leniency:

# Top-K accuracy computation
# logits: [B, num_classes] = image_features @ zeroshot_weights
# target: [B] = ground truth class indices

# 1. Find top-K predictions
_, pred = logits.topk(K, dim=1, largest=True, sorted=True)
# pred: [B, K]

# 2. Check if true label is in top-K
pred = pred.t()           # [K, B]
correct = pred.eq(target.view(1, -1).expand_as(pred))  # [K, B] boolean

# 3. Compute accuracy for each K value
for k in [1, 5]:
    correct_k = correct[:k].reshape(-1).float().sum()
    accuracy_k = correct_k / batch_size * 100.0

Expected results (CLIP paper, ViT-B/32 on ImageNetV2 with 80-template ensemble):

Top-1: ~55.93%
Top-5: ~83.36%

Related Pages

Implemented By

Implementation:Openai_CLIP_Accuracy_Function

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment