Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Openai CLIP Image Preprocessing

From Leeroopedia
Knowledge Sources
Domains Vision, Preprocessing
Last Updated 2026-02-13 22:00 GMT

Overview

A deterministic image transformation pipeline that converts raw images of arbitrary size into fixed-resolution, normalized tensors suitable for a vision encoder.

Description

Image Preprocessing is the standardized transformation applied to input images before they are fed into a vision model. For contrastive vision-language models, the preprocessing pipeline must exactly match the training-time augmentation to ensure feature consistency. The CLIP preprocessing pipeline consists of:

  1. Resize: Scale the image so that the shorter side matches the model's input resolution, using bicubic interpolation for quality preservation.
  2. Center Crop: Extract a square center crop of exactly n_px x n_px pixels to ensure a fixed spatial dimension.
  3. RGB Conversion: Convert any image mode (grayscale, RGBA, palette) to 3-channel RGB.
  4. Tensor Conversion: Convert PIL Image to a float32 tensor with values in [0, 1], shape [3, H, W].
  5. Normalization: Subtract the CLIP-specific channel means and divide by channel standard deviations, computed over the training dataset (WebImageText).

The normalization statistics are specific to CLIP's training data and differ from the commonly used ImageNet statistics.

Usage

Use this principle whenever feeding images into a CLIP model. The preprocessing transform is returned alongside the model by the model loading step and must be applied to all input images. Skipping or modifying preprocessing will produce incorrect embeddings.

Theoretical Basis

The key theoretical motivation is distribution alignment: the model was trained on images preprocessed with a specific pipeline, so inference-time images must undergo the same transformation to lie in the same input distribution.

CLIP-specific normalization constants:

# These values were computed from CLIP's training data (WebImageText)
# NOT the standard ImageNet mean/std
mean = (0.48145466, 0.4578275, 0.40821073)
std  = (0.26862954, 0.26130258, 0.27577711)

# Pipeline: Resize -> CenterCrop -> RGB -> ToTensor -> Normalize
# output_shape = [3, n_px, n_px] where n_px = model.visual.input_resolution

The input resolution n_px varies by model variant:

  • 224 pixels: RN50, RN101, ViT-B/32, ViT-B/16
  • 288 pixels: RN50x4
  • 384 pixels: RN50x16
  • 448 pixels: RN50x64
  • 336 pixels: ViT-L/14@336px

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment