Implementation:Openai CLIP Transform

Knowledge Sources	OpenAI CLIP
Domains	Vision, Preprocessing
Last Updated	2026-02-13 22:00 GMT

Overview

Concrete tool for building the CLIP image preprocessing pipeline provided by the OpenAI CLIP library.

Description

The _transform() internal function constructs a torchvision.transforms.Compose pipeline that converts a raw PIL image of any size into a normalized tensor of shape [3, n_px, n_px]. The pipeline uses bicubic interpolation for resizing and CLIP-specific normalization constants computed from the WebImageText training dataset.

Users do not call _transform() directly. Instead, it is called internally by clip.load(), which returns the resulting Compose object as the second element of its return tuple. The preprocessing pipeline is matched to the specific model variant's input resolution.

Usage

Use the preprocess object returned by clip.load() to transform PIL images before passing them to the model. Apply it to individual images, then batch the resulting tensors with torch.stack() or use it as a dataset transform.

Code Reference

Source Location

Repository: OpenAI CLIP
File: clip/clip.py
Lines: L75-86

Signature

def _transform(n_px: int) -> torchvision.transforms.Compose:
    """Build the CLIP image preprocessing pipeline.

    Parameters
    ----------
    n_px : int
        Target resolution in pixels. Derived from model.visual.input_resolution.

    Returns
    -------
    torchvision.transforms.Compose
        A composition of: Resize(n_px, bicubic) -> CenterCrop(n_px) ->
        _convert_image_to_rgb -> ToTensor() ->
        Normalize(mean, std)
    """
    return Compose([
        Resize(n_px, interpolation=BICUBIC),
        CenterCrop(n_px),
        _convert_image_to_rgb,
        ToTensor(),
        Normalize(
            (0.48145466, 0.4578275, 0.40821073),
            (0.26862954, 0.26130258, 0.27577711)
        ),
    ])

Import

# _transform is internal; users access the result via clip.load()
import clip
model, preprocess = clip.load("ViT-B/32")
# preprocess is the Compose object built by _transform()

I/O Contract

Inputs

Name	Type	Required	Description
image	PIL.Image.Image	Yes	Raw input image of any size and mode (RGB, RGBA, grayscale, palette)

Outputs

Name	Type	Description
tensor	torch.Tensor	Preprocessed image tensor of shape [3, n_px, n_px], dtype float32, with CLIP-specific normalization applied

Usage Examples

Preprocessing a Single Image

import clip
import torch
from PIL import Image

# Load model and get the preprocess transform
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Open and preprocess an image
image = Image.open("photo.jpg")
image_tensor = preprocess(image).unsqueeze(0).to(device)
# image_tensor shape: [1, 3, 224, 224]

Preprocessing a Batch of Images

import clip
import torch
from PIL import Image

model, preprocess = clip.load("ViT-B/32", device="cuda")

image_paths = ["cat.jpg", "dog.jpg", "bird.jpg"]
images = [preprocess(Image.open(p)) for p in image_paths]
image_batch = torch.stack(images).to("cuda")
# image_batch shape: [3, 3, 224, 224]

Using as a Dataset Transform

import clip
from torchvision.datasets import CIFAR100
from torch.utils.data import DataLoader

model, preprocess = clip.load("ViT-B/32", device="cuda")

# Pass preprocess directly as the transform argument
dataset = CIFAR100(root="~/.cache", download=True, train=False, transform=preprocess)
loader = DataLoader(dataset, batch_size=32, num_workers=2)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment