Implementation:Openai CLIP CLIP Encode Image

Knowledge Sources	OpenAI CLIP Learning Transferable Visual Models From Natural Language Supervision
Domains	Vision, Deep_Learning, Representation_Learning
Last Updated	2026-02-13 22:00 GMT

Overview

Concrete tool for encoding preprocessed images into embedding vectors provided by the CLIP model class.

Description

The CLIP.encode_image() method takes a batch of preprocessed image tensors and passes them through the vision encoder (either a VisionTransformer or ModifiedResNet, depending on the loaded model variant). It casts the input to the model's dtype (fp16 on GPU, fp32 on CPU) and delegates to self.visual, which is the vision encoder submodule.

The output is a tensor of shape [B, embed_dim] containing one feature vector per image. These vectors are not L2-normalized by this method; normalization must be applied explicitly when computing cosine similarities.

Usage

Call this method after preprocessing images with the transform returned by clip.load(). Use within a torch.no_grad() context for inference to save memory.

Code Reference

Source Location

Repository: OpenAI CLIP
File: clip/model.py
Lines: L340-341

Signature

def encode_image(self, image: torch.Tensor) -> torch.Tensor:
    """Encode images through the vision encoder.

    Casts input to model dtype and passes through self.visual
    (VisionTransformer or ModifiedResNet).

    Parameters
    ----------
    image : torch.Tensor
        Batch of preprocessed images, shape [B, 3, n_px, n_px].

    Returns
    -------
    torch.Tensor
        Image feature vectors, shape [B, embed_dim]. Not L2-normalized.
    """
    return self.visual(image.type(self.dtype))

Import

import clip
model, preprocess = clip.load("ViT-B/32")
# Then call: model.encode_image(image_tensor)

I/O Contract

Inputs

Name	Type	Required	Description
image	torch.Tensor	Yes	Batch of preprocessed images, shape [B, 3, n_px, n_px], on the same device as the model

Outputs

Name	Type	Description
image_features	torch.Tensor	Image embedding vectors of shape [B, embed_dim]. embed_dim depends on model variant (e.g. 512 for ViT-B/32, 768 for ViT-L/14). Not L2-normalized.

Usage Examples

Basic Image Encoding

import clip
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Preprocess and encode a single image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
# image_features.shape: [1, 512]

Encoding with L2 Normalization

import clip
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    # Normalize for cosine similarity
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
# image_features.shape: [1, 512], unit norm

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment