Principle:Openai CLIP Zero Shot Classifier Construction
| Knowledge Sources | |
|---|---|
| Domains | Vision, NLP, Zero_Shot_Learning, Classification |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
A classifier weight construction mechanism that converts natural language class descriptions into a text-embedding-based weight matrix for zero-shot image classification through prompt template ensembling.
Description
Zero-Shot Classifier Construction is the process of building a classification weight matrix entirely from text, without any training images. For each class, multiple text descriptions are generated from prompt templates, encoded by the CLIP text encoder, L2-normalized, averaged across templates, and L2-normalized again. The resulting vectors are stacked column-wise to form a weight matrix that can classify images via a simple dot product.
The process for each class is:
- Template expansion: For each class name, generate N text strings using N prompt templates (e.g., "a photo of a {class}", "a sculpture of a {class}", etc.).
- Text encoding: Tokenize and encode all N texts through the CLIP text encoder to get N embedding vectors.
- Per-template normalization: L2-normalize each embedding individually.
- Averaging: Compute the mean of all N normalized embeddings for this class.
- Final normalization: L2-normalize the averaged embedding to get the class prototype.
- Stacking: Stack all class prototypes column-wise to form the classifier weight matrix of shape [embed_dim, num_classes].
At inference time, classification is performed as: logits = image_features @ zeroshot_weights, where image_features are L2-normalized image embeddings.
Usage
Use this principle when building a prompt-engineered zero-shot classifier for any image classification dataset. This replaces the simple single-prompt approach with a more robust multi-template ensemble that improves accuracy by 3-5 percentage points.
Theoretical Basis
The ensemble construction can be understood as estimating a robust class centroid in CLIP's embedding space:
# Pseudo-code for classifier construction
zeroshot_weights = []
for classname in classnames:
# Generate template-expanded texts
texts = [template.format(classname) for template in templates]
# Encode all templates for this class
text_embeddings = encode_text(tokenize(texts)) # [N_templates, embed_dim]
# Normalize each template embedding
text_embeddings = text_embeddings / ||text_embeddings||
# Average across templates (centroid estimation)
class_embedding = mean(text_embeddings) # [embed_dim]
# Normalize the centroid
class_embedding = class_embedding / ||class_embedding||
zeroshot_weights.append(class_embedding)
# Stack: [embed_dim, N_classes]
zeroshot_weights = stack(zeroshot_weights, dim=1)
# Classification: logits = image_features @ zeroshot_weights
# logits shape: [B, N_classes]
The normalization-average-normalization pattern is important:
- Pre-average normalization ensures each template contributes equally regardless of embedding magnitude
- Post-average normalization ensures the final class prototype lies on the unit sphere, making the dot product equivalent to cosine similarity