Workflow:Openai CLIP Zero shot image classification

Knowledge Sources	OpenAI CLIP Learning Transferable Visual Models CLIP Blog Post
Domains	Computer_Vision, Zero_Shot_Learning, Multimodal
Last Updated	2026-02-13 22:00 GMT

Overview

End-to-end process for classifying images into arbitrary text-described categories using CLIP without any task-specific training data.

Description

This workflow demonstrates the core capability of CLIP: zero-shot image classification. Given an image and a set of candidate text labels, the model encodes both modalities into a shared embedding space and selects the label whose text embedding is most similar to the image embedding. No labeled training examples are needed for the target classes, as CLIP leverages its contrastive pre-training on 400 million image-text pairs to generalize to new categories at inference time.

Goal: Produce ranked predictions for an image against arbitrary text-described categories.

Scope: From a raw image and a list of class names to ranked label probabilities.

Strategy: Uses the dual-encoder architecture to independently embed images and text, then computes cosine similarity scaled by a learned temperature parameter to produce classification logits.

Usage

Execute this workflow when you have one or more images and a set of candidate class descriptions, and you need to classify the images without any labeled training data for those classes. This is appropriate for rapid prototyping, novel category recognition, or any scenario where collecting labeled data is impractical. A GPU with at least 4GB VRAM is sufficient for the ViT-B/32 model; larger models (ViT-L/14) require approximately 8GB.

Execution Steps

Step 1: Environment setup

Install the CLIP package and its dependencies (PyTorch, torchvision, ftfy, regex, tqdm). Verify that a compatible PyTorch version (1.7.1 or later) is available and determine whether to use GPU or CPU execution.

Key considerations:

CLIP requires PyTorch 1.7.1+ and torchvision
CUDA GPU accelerates inference but CPU is supported
The package includes a bundled BPE vocabulary file (~1.3MB)

Step 2: Model loading

Select a pretrained CLIP model variant and load it along with its associated image preprocessing transform. The loader downloads the model checkpoint (if not cached), determines whether it is a JIT archive or a state dict, and constructs the model architecture by inferring hyperparameters from the weight tensor shapes.

Key considerations:

Nine model variants available: RN50, RN101, RN50x4, RN50x16, RN50x64, ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px
Models are cached in ~/.cache/clip after first download
SHA256 checksum verification ensures download integrity
The returned preprocessing transform handles resize, center crop, RGB conversion, and normalization

Step 3: Image preprocessing

Apply the model-specific preprocessing transform to each input image. This converts a PIL Image into a normalized tensor at the resolution the model expects, then batches the tensors for efficient encoding.

What happens:

Resize to model input resolution (e.g., 224px for ViT-B/32) using bicubic interpolation
Center crop to square dimensions
Convert to RGB if necessary
Normalize pixel values using CLIP training statistics (mean and std)
Stack into a batch tensor and move to the target device

Step 4: Text tokenization

Convert each candidate class label into a tokenized sequence using the BPE tokenizer. Each text description is cleaned, lowercased, split into subword tokens, and wrapped with start-of-text and end-of-text special tokens, then zero-padded to the fixed context length of 77 tokens.

Key considerations:

Prefix each class name with a prompt phrase (e.g., "a photo of a {class}") to improve performance
The tokenizer uses a ~49K vocabulary with byte-pair encoding
Sequences exceeding 77 tokens are truncated or raise an error depending on settings

Step 5: Feature encoding

Pass the preprocessed image batch through the vision encoder and the tokenized text batch through the text encoder. Both encoders produce embedding vectors in the same shared latent space, enabling cross-modal comparison.

What happens:

Vision encoder (ViT or ModifiedResNet) produces one embedding vector per image
Text encoder (causal Transformer) produces one embedding vector per text sequence, taken from the end-of-text token position
Both embeddings are projected to the same dimensionality via learned projection matrices
Inference runs with gradients disabled for efficiency

Step 6: Similarity computation and prediction

Normalize the image and text feature vectors to unit length, compute cosine similarity between every image-text pair, scale by the learned temperature (logit_scale), and apply softmax to produce per-class probabilities. Select the top-k predictions.

What happens:

L2-normalize both image and text feature tensors
Compute dot product (cosine similarity) multiplied by 100 (the logit scale)
Apply softmax to get probability distribution over classes
Sort by probability to retrieve top predictions

Execution Diagram

GitHub URL

Workflow Repository