Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Openai CLIP CLIP Encode Image

From Leeroopedia
Knowledge Sources
Domains Vision, Deep_Learning, Representation_Learning
Last Updated 2026-02-13 22:00 GMT

Overview

Concrete tool for encoding preprocessed images into embedding vectors provided by the CLIP model class.

Description

The CLIP.encode_image() method takes a batch of preprocessed image tensors and passes them through the vision encoder (either a VisionTransformer or ModifiedResNet, depending on the loaded model variant). It casts the input to the model's dtype (fp16 on GPU, fp32 on CPU) and delegates to self.visual, which is the vision encoder submodule.

The output is a tensor of shape [B, embed_dim] containing one feature vector per image. These vectors are not L2-normalized by this method; normalization must be applied explicitly when computing cosine similarities.

Usage

Call this method after preprocessing images with the transform returned by clip.load(). Use within a torch.no_grad() context for inference to save memory.

Code Reference

Source Location

  • Repository: OpenAI CLIP
  • File: clip/model.py
  • Lines: L340-341

Signature

def encode_image(self, image: torch.Tensor) -> torch.Tensor:
    """Encode images through the vision encoder.

    Casts input to model dtype and passes through self.visual
    (VisionTransformer or ModifiedResNet).

    Parameters
    ----------
    image : torch.Tensor
        Batch of preprocessed images, shape [B, 3, n_px, n_px].

    Returns
    -------
    torch.Tensor
        Image feature vectors, shape [B, embed_dim]. Not L2-normalized.
    """
    return self.visual(image.type(self.dtype))

Import

import clip
model, preprocess = clip.load("ViT-B/32")
# Then call: model.encode_image(image_tensor)

I/O Contract

Inputs

Name Type Required Description
image torch.Tensor Yes Batch of preprocessed images, shape [B, 3, n_px, n_px], on the same device as the model

Outputs

Name Type Description
image_features torch.Tensor Image embedding vectors of shape [B, embed_dim]. embed_dim depends on model variant (e.g. 512 for ViT-B/32, 768 for ViT-L/14). Not L2-normalized.

Usage Examples

Basic Image Encoding

import clip
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Preprocess and encode a single image
image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
# image_features.shape: [1, 512]

Encoding with L2 Normalization

import clip
import torch
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("photo.jpg")).unsqueeze(0).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    # Normalize for cosine similarity
    image_features = image_features / image_features.norm(dim=-1, keepdim=True)
# image_features.shape: [1, 512], unit norm

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment