Principle:Openai CLIP Zero Shot Classifier Construction

Knowledge Sources	Learning Transferable Visual Models From Natural Language Supervision OpenAI CLIP Blog
Domains	Vision, NLP, Zero_Shot_Learning, Classification
Last Updated	2026-02-13 22:00 GMT

Overview

A classifier weight construction mechanism that converts natural language class descriptions into a text-embedding-based weight matrix for zero-shot image classification through prompt template ensembling.

Description

Zero-Shot Classifier Construction is the process of building a classification weight matrix entirely from text, without any training images. For each class, multiple text descriptions are generated from prompt templates, encoded by the CLIP text encoder, L2-normalized, averaged across templates, and L2-normalized again. The resulting vectors are stacked column-wise to form a weight matrix that can classify images via a simple dot product.

The process for each class is:

Template expansion: For each class name, generate N text strings using N prompt templates (e.g., "a photo of a {class}", "a sculpture of a {class}", etc.).
Text encoding: Tokenize and encode all N texts through the CLIP text encoder to get N embedding vectors.
Per-template normalization: L2-normalize each embedding individually.
Averaging: Compute the mean of all N normalized embeddings for this class.
Final normalization: L2-normalize the averaged embedding to get the class prototype.
Stacking: Stack all class prototypes column-wise to form the classifier weight matrix of shape [embed_dim, num_classes].

At inference time, classification is performed as: logits = image_features @ zeroshot_weights, where image_features are L2-normalized image embeddings.

Usage

Use this principle when building a prompt-engineered zero-shot classifier for any image classification dataset. This replaces the simple single-prompt approach with a more robust multi-template ensemble that improves accuracy by 3-5 percentage points.

Theoretical Basis

The ensemble construction can be understood as estimating a robust class centroid in CLIP's embedding space:

# Pseudo-code for classifier construction
zeroshot_weights = []
for classname in classnames:
    # Generate template-expanded texts
    texts = [template.format(classname) for template in templates]

    # Encode all templates for this class
    text_embeddings = encode_text(tokenize(texts))  # [N_templates, embed_dim]

    # Normalize each template embedding
    text_embeddings = text_embeddings / ||text_embeddings||

    # Average across templates (centroid estimation)
    class_embedding = mean(text_embeddings)  # [embed_dim]

    # Normalize the centroid
    class_embedding = class_embedding / ||class_embedding||

    zeroshot_weights.append(class_embedding)

# Stack: [embed_dim, N_classes]
zeroshot_weights = stack(zeroshot_weights, dim=1)

# Classification: logits = image_features @ zeroshot_weights
# logits shape: [B, N_classes]

The normalization-average-normalization pattern is important:

Pre-average normalization ensures each template contributes equally regardless of embedding magnitude
Post-average normalization ensures the final class prototype lies on the unit sphere, making the dot product equivalent to cosine similarity

Related Pages

Implemented By

Implementation:Openai_CLIP_Zeroshot_Classifier

Uses Heuristic

Heuristic:Openai_CLIP_Template_Ensemble_For_Zero_Shot

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment