Heuristic:Openai CLIP L2 Normalization For Cosine Similarity
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Computer_Vision, NLP |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
Feature vectors from CLIP encoders must be L2-normalized before computing similarity; without normalization, dot products do not represent cosine similarity and classification quality degrades.
Description
CLIP's contrastive training objective operates on L2-normalized feature embeddings. At inference time, the `forward()` method normalizes both image and text features before computing the scaled dot product. However, `encode_image()` and `encode_text()` return unnormalized feature vectors. When using these methods independently (e.g., for zero-shot classification or retrieval), the caller must explicitly normalize the features. Forgetting this step is a common source of poor results.
Usage
Apply this heuristic every time you use `encode_image()` or `encode_text()` independently (outside of `model.forward()`). The `forward()` method handles normalization internally, but if you call the encoders directly, you must normalize yourself before computing similarities.
The Insight (Rule of Thumb)
- Action: After calling `encode_image()` or `encode_text()`, divide each feature vector by its L2 norm: `features /= features.norm(dim=-1, keepdim=True)`.
- Value: Converts raw dot products into cosine similarity (range [-1, 1]).
- Trade-off: Negligible compute cost; failing to normalize will produce meaningless similarity scores.
- Pattern: The temperature-scaled logit computation is `logit_scale * normalized_image_features @ normalized_text_features.T`.
Reasoning
CLIP is trained with a contrastive loss on L2-normalized embeddings, meaning the model learns to place semantically similar items at small angular distances on the unit hypersphere. Without normalization, the magnitude of feature vectors varies across inputs, and dot products conflate magnitude with direction, producing unreliable similarity rankings. The `logit_scale` parameter (initialized to `ln(1/0.07) ≈ 2.66`, exponentiated to ~14.3) amplifies cosine similarities into sharper logit distributions for softmax classification.
Code Evidence
Normalization in CLIP.forward() from `clip/model.py:358-369`:
def forward(self, image, text):
image_features = self.encode_image(image)
text_features = self.encode_text(text)
# normalized features
image_features = image_features / image_features.norm(dim=1, keepdim=True)
text_features = text_features / text_features.norm(dim=1, keepdim=True)
# cosine similarity as logits
logit_scale = self.logit_scale.exp()
logits_per_image = logit_scale * image_features @ text_features.t()
Manual normalization in README zero-shot example from `README.md:115-117`:
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
Normalization in zeroshot_classifier() from notebook cell 15:
class_embeddings /= class_embeddings.norm(dim=-1, keepdim=True)
class_embedding = class_embeddings.mean(dim=0)
class_embedding /= class_embedding.norm()