Workflow:Openai CLIP Zero shot image classification
| Knowledge Sources | |
|---|---|
| Domains | Computer_Vision, Zero_Shot_Learning, Multimodal |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
End-to-end process for classifying images into arbitrary text-described categories using CLIP without any task-specific training data.
Description
This workflow demonstrates the core capability of CLIP: zero-shot image classification. Given an image and a set of candidate text labels, the model encodes both modalities into a shared embedding space and selects the label whose text embedding is most similar to the image embedding. No labeled training examples are needed for the target classes, as CLIP leverages its contrastive pre-training on 400 million image-text pairs to generalize to new categories at inference time.
Goal: Produce ranked predictions for an image against arbitrary text-described categories.
Scope: From a raw image and a list of class names to ranked label probabilities.
Strategy: Uses the dual-encoder architecture to independently embed images and text, then computes cosine similarity scaled by a learned temperature parameter to produce classification logits.
Usage
Execute this workflow when you have one or more images and a set of candidate class descriptions, and you need to classify the images without any labeled training data for those classes. This is appropriate for rapid prototyping, novel category recognition, or any scenario where collecting labeled data is impractical. A GPU with at least 4GB VRAM is sufficient for the ViT-B/32 model; larger models (ViT-L/14) require approximately 8GB.
Execution Steps
Step 1: Environment setup
Install the CLIP package and its dependencies (PyTorch, torchvision, ftfy, regex, tqdm). Verify that a compatible PyTorch version (1.7.1 or later) is available and determine whether to use GPU or CPU execution.
Key considerations:
- CLIP requires PyTorch 1.7.1+ and torchvision
- CUDA GPU accelerates inference but CPU is supported
- The package includes a bundled BPE vocabulary file (~1.3MB)
Step 2: Model loading
Select a pretrained CLIP model variant and load it along with its associated image preprocessing transform. The loader downloads the model checkpoint (if not cached), determines whether it is a JIT archive or a state dict, and constructs the model architecture by inferring hyperparameters from the weight tensor shapes.
Key considerations:
- Nine model variants available: RN50, RN101, RN50x4, RN50x16, RN50x64, ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px
- Models are cached in ~/.cache/clip after first download
- SHA256 checksum verification ensures download integrity
- The returned preprocessing transform handles resize, center crop, RGB conversion, and normalization
Step 3: Image preprocessing
Apply the model-specific preprocessing transform to each input image. This converts a PIL Image into a normalized tensor at the resolution the model expects, then batches the tensors for efficient encoding.
What happens:
- Resize to model input resolution (e.g., 224px for ViT-B/32) using bicubic interpolation
- Center crop to square dimensions
- Convert to RGB if necessary
- Normalize pixel values using CLIP training statistics (mean and std)
- Stack into a batch tensor and move to the target device
Step 4: Text tokenization
Convert each candidate class label into a tokenized sequence using the BPE tokenizer. Each text description is cleaned, lowercased, split into subword tokens, and wrapped with start-of-text and end-of-text special tokens, then zero-padded to the fixed context length of 77 tokens.
Key considerations:
- Prefix each class name with a prompt phrase (e.g., "a photo of a {class}") to improve performance
- The tokenizer uses a ~49K vocabulary with byte-pair encoding
- Sequences exceeding 77 tokens are truncated or raise an error depending on settings
Step 5: Feature encoding
Pass the preprocessed image batch through the vision encoder and the tokenized text batch through the text encoder. Both encoders produce embedding vectors in the same shared latent space, enabling cross-modal comparison.
What happens:
- Vision encoder (ViT or ModifiedResNet) produces one embedding vector per image
- Text encoder (causal Transformer) produces one embedding vector per text sequence, taken from the end-of-text token position
- Both embeddings are projected to the same dimensionality via learned projection matrices
- Inference runs with gradients disabled for efficiency
Step 6: Similarity computation and prediction
Normalize the image and text feature vectors to unit length, compute cosine similarity between every image-text pair, scale by the learned temperature (logit_scale), and apply softmax to produce per-class probabilities. Select the top-k predictions.
What happens:
- L2-normalize both image and text feature tensors
- Compute dot product (cosine similarity) multiplied by 100 (the logit scale)
- Apply softmax to get probability distribution over classes
- Sort by probability to retrieve top predictions