Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Openai CLIP Contrastive Similarity Prediction

From Leeroopedia
Knowledge Sources
Domains Vision, NLP, Contrastive_Learning, Representation_Learning
Last Updated 2026-02-13 22:00 GMT

Overview

A prediction mechanism that computes scaled cosine similarity between image and text embeddings to produce classification logits, using a learned temperature parameter.

Description

Contrastive Similarity Prediction is the final step in CLIP's zero-shot classification pipeline. Given image feature vectors and text feature vectors in a shared embedding space, it computes the cosine similarity between every image-text pair, scales by a learned temperature parameter, and produces logits that can be converted to probabilities via softmax.

The mechanism consists of:

  1. L2 normalization: Both image and text features are normalized to unit length, ensuring that dot products equal cosine similarities.
  2. Scaled dot product: The cosine similarities are multiplied by a learned temperature parameter (logit_scale), which sharpens or softens the probability distribution.
  3. Symmetric logits: The output includes both logits_per_image (each image scored against all texts) and logits_per_text (each text scored against all images), which are transposes of each other.
  4. Softmax prediction: Applying softmax(dim=-1) to logits_per_image yields per-image class probabilities for zero-shot classification.

This mechanism was used during CLIP's contrastive pre-training with a symmetric cross-entropy loss over matched image-text pairs, and at inference time serves as a zero-shot classifier.

Usage

Use this principle for zero-shot image classification by comparing a single image against multiple text descriptions (class labels). The text description with the highest similarity score is the predicted class. Also used for image-text retrieval and ranking tasks.

Theoretical Basis

The contrastive similarity computation follows the InfoNCE framework:

# Core computation
# 1. Normalize features to unit sphere
image_features = image_features / ||image_features||
text_features = text_features / ||text_features||

# 2. Compute scaled cosine similarity
logit_scale = exp(learned_parameter)  # initialized to exp(ln(1/0.07)) = 1/0.07 ≈ 14.29
logits_per_image = logit_scale * (image_features @ text_features.T)

# 3. Convert to probabilities
probs = softmax(logits_per_image, dim=-1)
# probs[i, j] = probability that image i matches text j

The learned temperature (logit_scale) is critical:

  • Initialized to ln(1/0.07) ≈ 2.66, which corresponds to a temperature of 0.07
  • During training, this parameter is optimized alongside model weights
  • Higher logit_scale produces sharper (more confident) probability distributions
  • Lower logit_scale produces softer (more uncertain) distributions

The symmetric structure means the same forward pass produces both:

  • logits_per_image: shape [B_img, B_txt] — for "which text best describes this image?"
  • logits_per_text: shape [B_txt, B_img] — for "which image best matches this text?"

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment