Principle:Openai CLIP Prompt Engineering

Knowledge Sources	Learning Transferable Visual Models From Natural Language Supervision OpenAI CLIP Blog
Domains	NLP, Vision, Zero_Shot_Learning
Last Updated	2026-02-13 22:00 GMT

Overview

A text prompt design strategy that uses curated class name lists and diverse natural language template strings to improve zero-shot classification accuracy through text embedding ensembling.

Description

Prompt Engineering for CLIP is the practice of crafting effective text descriptions that serve as class prototypes for zero-shot classification. Instead of using bare class names (e.g., "dog"), well-designed prompts provide contextual framing (e.g., "a photo of a dog") that better matches the distribution of text seen during CLIP's contrastive pre-training.

The technique consists of two components:

Class name curation: Disambiguating class labels that may be ambiguous in isolation. For example, "crane" could be a bird or a machine, so the class name is modified to "crane bird" or "construction crane" to reduce confusion. The CLIP paper demonstrates this with ImageNet's 1000 classes.
Template design: Creating multiple template strings with a '{}' placeholder for the class name. Each template provides different context (e.g., "a photo of a {}", "a bad photo of the {}", "a sculpture of a {}"). The CLIP paper uses 80 templates for ImageNet.

Using multiple templates and averaging the resulting text embeddings (prompt ensembling) consistently improves accuracy over single prompts across many benchmarks.

Usage

Use this principle when performing zero-shot classification with CLIP and accuracy matters. Simple single-prompt classification (e.g., "a photo of a {}") works but leaves performance on the table. Prompt engineering with template ensembling can improve accuracy by 3-5 percentage points on benchmarks like ImageNet.

Theoretical Basis

The theoretical basis for prompt engineering in CLIP rests on the polysemy problem and distributional coverage:

# Problem: Bare class names are ambiguous
# "crane" -> bird or machine?
# "nail" -> metal nail or fingernail?

# Solution 1: Disambiguate class names
imagenet_classes = [
    ...,
    "metal nail",           # not just "nail"
    "kite (bird of prey)",  # not just "kite"
    ...
]

# Solution 2: Use multiple templates for distributional coverage
imagenet_templates = [
    "a photo of a {}.",
    "a bad photo of the {}.",
    "a sculpture of a {}.",
    "itap of a {}.",          # "I took a picture of a"
    "a origami {}.",
    # ... 80 templates total
]

# Solution 3: Ensemble by averaging text embeddings
# For each class:
#   1. Generate all template-class combinations
#   2. Encode each combination with CLIP text encoder
#   3. L2-normalize each embedding
#   4. Average across templates
#   5. L2-normalize the averaged embedding
# Result: one robust text prototype per class

The ensembling step is equivalent to creating a more robust estimate of the class's location in CLIP's embedding space by sampling multiple views of how the class might be described in natural language.

Related Pages

Implemented By

Implementation:Openai_CLIP_Class_Label_Template_Preparation

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment