Implementation:Mlfoundations Open flamingo Image processor pipeline

Overview

User-defined pattern for preprocessing raw images through the CLIP image processor and assembling them into the 6D tensor format required by OpenFlamingo.

Description

This is a Pattern Doc. The image_processor callable returned by create_model_and_transforms applies CLIP-compatible transforms (resize, center crop, normalize). Users must then stack individual image tensors using torch.cat and reshape to (B, T_img, F, C, H, W) where F=1. The Flamingo._encode_vision_x method expects this exact shape.

The pipeline is not a single function call but rather a multi-step user-side pattern that bridges the gap between raw PIL images and the model's expected input contract. The image_processor handles per-image normalization, while the batching and reshaping into the 6D format is the caller's responsibility.

Usage

After obtaining image_processor from create_model_and_transforms, apply it to each PIL image individually to get a (C, H, W) tensor, then assemble the full batch in the 6D format before passing as vision_x to the model.

Code Reference

Source: Repository https://github.com/mlfoundations/open_flamingo

File: open_flamingo/src/factory.py Lines L42–46 — image_processor is obtained from open_clip during model construction
File: open_flamingo/src/flamingo.py Lines L177–200 — _encode_vision_x expects shape (B, T_img, F, C, H, W)

Interface pattern:

# Step 1: Process individual images
processed = image_processor(pil_image)  # Returns torch.Tensor of shape (C, H, W)

# Step 2: Stack into batch with sequence dimension
vision_x = torch.cat([img.unsqueeze(0) for img in images], dim=0)  # (T_img, C, H, W)

# Step 3: Reshape to required 6D format
vision_x = vision_x.unsqueeze(0).unsqueeze(2)  # (1, T_img, 1, C, H, W) for single example
# For batched: (B, T_img, F, C, H, W) where F=1

Import: The image_processor is returned by create_model_and_transforms(), not imported separately. There is no standalone module to import; it is a closure produced at model initialization time.

I/O Contract

**Inputs**
Name	Type	Required	Description
pil_images	List[PIL.Image]	Yes	Raw images to preprocess
image_processor	Callable	Yes	CLIP image preprocessing function from `create_model_and_transforms`

**Outputs**
Name	Type	Description
vision_x	torch.Tensor	Shape `(B, T_img, 1, C, H, W)` — preprocessed and batched image tensor ready for Flamingo

Usage Examples

The following example demonstrates preparing 2 demo images and 1 query image for few-shot inference, then assembling the 6D tensor:

from PIL import Image
import torch
from open_flamingo import create_model_and_transforms

# Initialize model and get the image processor
model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="anas-awadalla/mpt-1b-redpajama-200b",
    tokenizer_path="anas-awadalla/mpt-1b-redpajama-200b",
    cross_attn_every_n_layers=1,
)

# Load raw images
demo_image_1 = Image.open("demo_cat.jpg")
demo_image_2 = Image.open("demo_dog.jpg")
query_image = Image.open("query_animal.jpg")

# Step 1: Apply CLIP preprocessing to each image individually
demo_1_tensor = image_processor(demo_image_1)   # (C, H, W)
demo_2_tensor = image_processor(demo_image_2)   # (C, H, W)
query_tensor = image_processor(query_image)      # (C, H, W)

# Step 2: Stack along the T_img dimension (3 images in sequence)
all_images = torch.cat([
    demo_1_tensor.unsqueeze(0),
    demo_2_tensor.unsqueeze(0),
    query_tensor.unsqueeze(0),
], dim=0)  # (3, C, H, W)

# Step 3: Reshape to 6D format: (B, T_img, F, C, H, W)
# B=1 (single example), T_img=3 (two demos + one query), F=1 (still images)
vision_x = all_images.unsqueeze(0).unsqueeze(2)  # (1, 3, 1, C, H, W)

# Now vision_x is ready for model.generate(vision_x=vision_x, ...)

Environment:Mlfoundations_Open_flamingo_HuggingFace_Open_CLIP_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment