Implementation:Openai CLIP CLIP Encode Image
| Knowledge Sources | |
|---|---|
| Domains | Vision, Deep_Learning, Representation_Learning |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
Concrete tool for encoding preprocessed images into embedding vectors provided by the CLIP model class.
Description
The CLIP.encode_image() method takes a batch of preprocessed image tensors and passes them through the vision encoder (either a VisionTransformer or ModifiedResNet, depending on the loaded model variant). It casts the input to the model's dtype (fp16 on GPU, fp32 on CPU) and delegates to self.visual, which is the vision encoder submodule.
The output is a tensor of shape [B, embed_dim] containing one feature vector per image. These vectors are not L2-normalized by this method; normalization must be applied explicitly when computing cosine similarities.
Usage
Call this method after preprocessing images with the transform returned by clip.load(). Use within a torch.no_grad() context for inference to save memory.
Code Reference
Source Location
- Repository: OpenAI CLIP
- File: clip/model.py
- Lines: L340-341
Signature
def encode_image(self, image: torch.Tensor) -> torch.Tensor:
"""Encode images through the vision encoder.
Casts input to model dtype and passes through self.visual
(VisionTransformer or ModifiedResNet).
Parameters
----------
image : torch.Tensor
Batch of preprocessed images, shape [B, 3, n_px, n_px].
Returns
-------
torch.Tensor
Image feature vectors, shape [B, embed_dim]. Not L2-normalized.
"""
return self.visual(image.type(self.dtype))
Import
import clip
model, preprocess = clip.load("ViT-B/32")
# Then call: model.encode_image(image_tensor)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| image | torch.Tensor | Yes | Batch of preprocessed images, shape [B, 3, n_px, n_px], on the same device as the model |
Outputs
| Name | Type | Description |
|---|---|---|
| image_features | torch.Tensor | Image embedding vectors of shape [B, embed_dim]. embed_dim depends on model variant (e.g. 512 for ViT-B/32, 768 for ViT-L/14). Not L2-normalized. |
Usage Examples
Basic Image Encoding
import clip
import torch
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Preprocess and encode a single image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
# image_features.shape: [1, 512]
Encoding with L2 Normalization
import clip
import torch
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
# Normalize for cosine similarity
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
# image_features.shape: [1, 512], unit norm