Heuristic:Openai CLIP Template Ensemble For Zero Shot
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Computer_Vision, NLP |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
Ensembling 80 diverse text templates per class improves CLIP zero-shot accuracy by 3-5% over a single template; a subset of 7 optimally-selected templates performs even better.
Description
Instead of using a single prompt like "a photo of a {class}", CLIP's prompt engineering approach generates multiple text descriptions per class using 80 diverse templates (e.g., "a bad photo of a {}", "a sculpture of a {}", "a {} in a video game."). Each template-class pair is encoded, L2-normalized, and then averaged to produce a single class embedding. This ensembling reduces the noise of any single template and captures multiple visual perspectives of each class. Sequential forward selection over the 80 templates found that a subset of just 7 templates outperforms the full ensemble, especially on smaller models.
Usage
Apply this heuristic when performing zero-shot classification with CLIP to maximize accuracy. Use the full 80-template ensemble for best general results, or use the optimized 7-template subset for faster computation with equal or better accuracy.
The Insight (Rule of Thumb)
- Action: For each class, generate text from multiple templates, encode all, normalize, average the embeddings, then re-normalize the averaged vector.
- Value: 80 templates yield +3-5% top-1 accuracy over single template; 7 selected templates yield even better results.
- Trade-off: 80x more text encoding per class (one-time cost at classifier construction); negligible impact on inference speed since the classifier weights are precomputed.
- Optimal 7 templates (in selection order):
- `itap of a {}.`
- `a bad photo of the {}.`
- `a origami {}.`
- `a photo of the large {}.`
- `a {} in a video game.`
- `art of the {}.`
- `a photo of the small {}.`
Reasoning
CLIP was trained on noisy internet data with diverse image-text pairings. A single template like "a photo of a {class}" only captures one perspective. The 80 templates include diverse visual contexts: bad photos, sculptures, drawings, video games, origami, tattoos, different sizes, and different lighting conditions. Averaging these embeddings on the unit hypersphere finds a centroid that is robust to the visual diversity of real-world images. The optimal 7-template subset notably includes different scales (large, small), difficulty (bad photo), and abstract renditions (origami, video game, art), suggesting CLIP benefits from diverse conceptual coverage.
Code Evidence
Template ensemble construction from notebook cell 15:
def zeroshot_classifier(classnames, templates):
with torch.no_grad():
zeroshot_weights = []
for classname in tqdm(classnames):
texts = [template.format(classname) for template in templates]
texts = clip.tokenize(texts).cuda()
class_embeddings = model.encode_text(texts)
class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
class_embedding = class_embeddings.mean(dim=0)
class_embedding /= class_embedding.norm()
zeroshot_weights.append(class_embedding)
zeroshot_weights = torch.stack(zeroshot_weights, dim=1).cuda()
return zeroshot_weights
Template count from notebook cell 10:
print(f"{len(imagenet_classes)} classes, {len(imagenet_templates)} templates")
# Output: 1000 classes, 80 templates
Commentary on template selection from notebook cell 11 (markdown):
"After the 80 templates were "locked" for the paper, we ran sequential forward selection over the list of 80 templates. The search terminated after ensembling 7 templates."