Principle:Openai CLIP Contrastive Similarity Prediction

Knowledge Sources	Learning Transferable Visual Models From Natural Language Supervision Representation Learning with Contrastive Predictive Coding
Domains	Vision, NLP, Contrastive_Learning, Representation_Learning
Last Updated	2026-02-13 22:00 GMT

Overview

A prediction mechanism that computes scaled cosine similarity between image and text embeddings to produce classification logits, using a learned temperature parameter.

Description

Contrastive Similarity Prediction is the final step in CLIP's zero-shot classification pipeline. Given image feature vectors and text feature vectors in a shared embedding space, it computes the cosine similarity between every image-text pair, scales by a learned temperature parameter, and produces logits that can be converted to probabilities via softmax.

The mechanism consists of:

L2 normalization: Both image and text features are normalized to unit length, ensuring that dot products equal cosine similarities.
Scaled dot product: The cosine similarities are multiplied by a learned temperature parameter (logit_scale), which sharpens or softens the probability distribution.
Symmetric logits: The output includes both logits_per_image (each image scored against all texts) and logits_per_text (each text scored against all images), which are transposes of each other.
Softmax prediction: Applying softmax(dim=-1) to logits_per_image yields per-image class probabilities for zero-shot classification.

This mechanism was used during CLIP's contrastive pre-training with a symmetric cross-entropy loss over matched image-text pairs, and at inference time serves as a zero-shot classifier.

Usage

Use this principle for zero-shot image classification by comparing a single image against multiple text descriptions (class labels). The text description with the highest similarity score is the predicted class. Also used for image-text retrieval and ranking tasks.

Theoretical Basis

The contrastive similarity computation follows the InfoNCE framework:

# Core computation
# 1. Normalize features to unit sphere
image_features = image_features / ||image_features||
text_features = text_features / ||text_features||

# 2. Compute scaled cosine similarity
logit_scale = exp(learned_parameter)  # initialized to exp(ln(1/0.07)) = 1/0.07 ≈ 14.29
logits_per_image = logit_scale * (image_features @ text_features.T)

# 3. Convert to probabilities
probs = softmax(logits_per_image, dim=-1)
# probs[i, j] = probability that image i matches text j

The learned temperature (logit_scale) is critical:

Initialized to ln(1/0.07) ≈ 2.66, which corresponds to a temperature of 0.07
During training, this parameter is optimized alongside model weights
Higher logit_scale produces sharper (more confident) probability distributions
Lower logit_scale produces softer (more uncertain) distributions

The symmetric structure means the same forward pass produces both:

logits_per_image: shape [B_img, B_txt] — for "which text best describes this image?"
logits_per_text: shape [B_txt, B_img] — for "which image best matches this text?"

Related Pages

Implemented By

Implementation:Openai_CLIP_CLIP_Forward

Uses Heuristic

Heuristic:Openai_CLIP_L2_Normalization_For_Cosine_Similarity

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment