Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Openai CLIP Zero shot image classification

From Leeroopedia
Knowledge Sources
Domains Computer_Vision, Zero_Shot_Learning, Multimodal
Last Updated 2026-02-13 22:00 GMT

Overview

End-to-end process for classifying images into arbitrary text-described categories using CLIP without any task-specific training data.

Description

This workflow demonstrates the core capability of CLIP: zero-shot image classification. Given an image and a set of candidate text labels, the model encodes both modalities into a shared embedding space and selects the label whose text embedding is most similar to the image embedding. No labeled training examples are needed for the target classes, as CLIP leverages its contrastive pre-training on 400 million image-text pairs to generalize to new categories at inference time.

Goal: Produce ranked predictions for an image against arbitrary text-described categories.

Scope: From a raw image and a list of class names to ranked label probabilities.

Strategy: Uses the dual-encoder architecture to independently embed images and text, then computes cosine similarity scaled by a learned temperature parameter to produce classification logits.

Usage

Execute this workflow when you have one or more images and a set of candidate class descriptions, and you need to classify the images without any labeled training data for those classes. This is appropriate for rapid prototyping, novel category recognition, or any scenario where collecting labeled data is impractical. A GPU with at least 4GB VRAM is sufficient for the ViT-B/32 model; larger models (ViT-L/14) require approximately 8GB.

Execution Steps

Step 1: Environment setup

Install the CLIP package and its dependencies (PyTorch, torchvision, ftfy, regex, tqdm). Verify that a compatible PyTorch version (1.7.1 or later) is available and determine whether to use GPU or CPU execution.

Key considerations:

  • CLIP requires PyTorch 1.7.1+ and torchvision
  • CUDA GPU accelerates inference but CPU is supported
  • The package includes a bundled BPE vocabulary file (~1.3MB)

Step 2: Model loading

Select a pretrained CLIP model variant and load it along with its associated image preprocessing transform. The loader downloads the model checkpoint (if not cached), determines whether it is a JIT archive or a state dict, and constructs the model architecture by inferring hyperparameters from the weight tensor shapes.

Key considerations:

  • Nine model variants available: RN50, RN101, RN50x4, RN50x16, RN50x64, ViT-B/32, ViT-B/16, ViT-L/14, ViT-L/14@336px
  • Models are cached in ~/.cache/clip after first download
  • SHA256 checksum verification ensures download integrity
  • The returned preprocessing transform handles resize, center crop, RGB conversion, and normalization

Step 3: Image preprocessing

Apply the model-specific preprocessing transform to each input image. This converts a PIL Image into a normalized tensor at the resolution the model expects, then batches the tensors for efficient encoding.

What happens:

  • Resize to model input resolution (e.g., 224px for ViT-B/32) using bicubic interpolation
  • Center crop to square dimensions
  • Convert to RGB if necessary
  • Normalize pixel values using CLIP training statistics (mean and std)
  • Stack into a batch tensor and move to the target device

Step 4: Text tokenization

Convert each candidate class label into a tokenized sequence using the BPE tokenizer. Each text description is cleaned, lowercased, split into subword tokens, and wrapped with start-of-text and end-of-text special tokens, then zero-padded to the fixed context length of 77 tokens.

Key considerations:

  • Prefix each class name with a prompt phrase (e.g., "a photo of a {class}") to improve performance
  • The tokenizer uses a ~49K vocabulary with byte-pair encoding
  • Sequences exceeding 77 tokens are truncated or raise an error depending on settings

Step 5: Feature encoding

Pass the preprocessed image batch through the vision encoder and the tokenized text batch through the text encoder. Both encoders produce embedding vectors in the same shared latent space, enabling cross-modal comparison.

What happens:

  • Vision encoder (ViT or ModifiedResNet) produces one embedding vector per image
  • Text encoder (causal Transformer) produces one embedding vector per text sequence, taken from the end-of-text token position
  • Both embeddings are projected to the same dimensionality via learned projection matrices
  • Inference runs with gradients disabled for efficiency

Step 6: Similarity computation and prediction

Normalize the image and text feature vectors to unit length, compute cosine similarity between every image-text pair, scale by the learned temperature (logit_scale), and apply softmax to produce per-class probabilities. Select the top-k predictions.

What happens:

  • L2-normalize both image and text feature tensors
  • Compute dot product (cosine similarity) multiplied by 100 (the logit scale)
  • Apply softmax to get probability distribution over classes
  • Sort by probability to retrieve top predictions

Execution Diagram

GitHub URL

Workflow Repository