Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Openai CLIP Transform

From Leeroopedia
Knowledge Sources
Domains Vision, Preprocessing
Last Updated 2026-02-13 22:00 GMT

Overview

Concrete tool for building the CLIP image preprocessing pipeline provided by the OpenAI CLIP library.

Description

The _transform() internal function constructs a torchvision.transforms.Compose pipeline that converts a raw PIL image of any size into a normalized tensor of shape [3, n_px, n_px]. The pipeline uses bicubic interpolation for resizing and CLIP-specific normalization constants computed from the WebImageText training dataset.

Users do not call _transform() directly. Instead, it is called internally by clip.load(), which returns the resulting Compose object as the second element of its return tuple. The preprocessing pipeline is matched to the specific model variant's input resolution.

Usage

Use the preprocess object returned by clip.load() to transform PIL images before passing them to the model. Apply it to individual images, then batch the resulting tensors with torch.stack() or use it as a dataset transform.

Code Reference

Source Location

  • Repository: OpenAI CLIP
  • File: clip/clip.py
  • Lines: L75-86

Signature

def _transform(n_px: int) -> torchvision.transforms.Compose:
    """Build the CLIP image preprocessing pipeline.

    Parameters
    ----------
    n_px : int
        Target resolution in pixels. Derived from model.visual.input_resolution.

    Returns
    -------
    torchvision.transforms.Compose
        A composition of: Resize(n_px, bicubic) -> CenterCrop(n_px) ->
        _convert_image_to_rgb -> ToTensor() ->
        Normalize(mean, std)
    """
    return Compose([
        Resize(n_px, interpolation=BICUBIC),
        CenterCrop(n_px),
        _convert_image_to_rgb,
        ToTensor(),
        Normalize(
            (0.48145466, 0.4578275, 0.40821073),
            (0.26862954, 0.26130258, 0.27577711)
        ),
    ])

Import

# _transform is internal; users access the result via clip.load()
import clip
model, preprocess = clip.load("ViT-B/32")
# preprocess is the Compose object built by _transform()

I/O Contract

Inputs

Name Type Required Description
image PIL.Image.Image Yes Raw input image of any size and mode (RGB, RGBA, grayscale, palette)

Outputs

Name Type Description
tensor torch.Tensor Preprocessed image tensor of shape [3, n_px, n_px], dtype float32, with CLIP-specific normalization applied

Usage Examples

Preprocessing a Single Image

import clip
import torch
from PIL import Image

# Load model and get the preprocess transform
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Open and preprocess an image
image = Image.open("photo.jpg")
image_tensor = preprocess(image).unsqueeze(0).to(device)
# image_tensor shape: [1, 3, 224, 224]

Preprocessing a Batch of Images

import clip
import torch
from PIL import Image

model, preprocess = clip.load("ViT-B/32", device="cuda")

image_paths = ["cat.jpg", "dog.jpg", "bird.jpg"]
images = [preprocess(Image.open(p)) for p in image_paths]
image_batch = torch.stack(images).to("cuda")
# image_batch shape: [3, 3, 224, 224]

Using as a Dataset Transform

import clip
from torchvision.datasets import CIFAR100
from torch.utils.data import DataLoader

model, preprocess = clip.load("ViT-B/32", device="cuda")

# Pass preprocess directly as the transform argument
dataset = CIFAR100(root="~/.cache", download=True, train=False, transform=preprocess)
loader = DataLoader(dataset, batch_size=32, num_workers=2)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment