Implementation:Mlfoundations Open flamingo Image processor pipeline
Overview
User-defined pattern for preprocessing raw images through the CLIP image processor and assembling them into the 6D tensor format required by OpenFlamingo.
Description
This is a Pattern Doc. The image_processor callable returned by create_model_and_transforms applies CLIP-compatible transforms (resize, center crop, normalize). Users must then stack individual image tensors using torch.cat and reshape to (B, T_img, F, C, H, W) where F=1. The Flamingo._encode_vision_x method expects this exact shape.
The pipeline is not a single function call but rather a multi-step user-side pattern that bridges the gap between raw PIL images and the model's expected input contract. The image_processor handles per-image normalization, while the batching and reshaping into the 6D format is the caller's responsibility.
Usage
After obtaining image_processor from create_model_and_transforms, apply it to each PIL image individually to get a (C, H, W) tensor, then assemble the full batch in the 6D format before passing as vision_x to the model.
Code Reference
Source: Repository https://github.com/mlfoundations/open_flamingo
- File:
open_flamingo/src/factory.pyLines L42–46 —image_processoris obtained fromopen_clipduring model construction - File:
open_flamingo/src/flamingo.pyLines L177–200 —_encode_vision_xexpects shape(B, T_img, F, C, H, W)
Interface pattern:
# Step 1: Process individual images
processed = image_processor(pil_image) # Returns torch.Tensor of shape (C, H, W)
# Step 2: Stack into batch with sequence dimension
vision_x = torch.cat([img.unsqueeze(0) for img in images], dim=0) # (T_img, C, H, W)
# Step 3: Reshape to required 6D format
vision_x = vision_x.unsqueeze(0).unsqueeze(2) # (1, T_img, 1, C, H, W) for single example
# For batched: (B, T_img, F, C, H, W) where F=1
Import: The image_processor is returned by create_model_and_transforms(), not imported separately. There is no standalone module to import; it is a closure produced at model initialization time.
I/O Contract
| Name | Type | Required | Description |
|---|---|---|---|
| pil_images | List[PIL.Image] | Yes | Raw images to preprocess |
| image_processor | Callable | Yes | CLIP image preprocessing function from create_model_and_transforms
|
| Name | Type | Description |
|---|---|---|
| vision_x | torch.Tensor | Shape (B, T_img, 1, C, H, W) — preprocessed and batched image tensor ready for Flamingo
|
Usage Examples
The following example demonstrates preparing 2 demo images and 1 query image for few-shot inference, then assembling the 6D tensor:
from PIL import Image
import torch
from open_flamingo import create_model_and_transforms
# Initialize model and get the image processor
model, image_processor, tokenizer = create_model_and_transforms(
clip_vision_encoder_path="ViT-L-14",
clip_vision_encoder_pretrained="openai",
lang_encoder_path="anas-awadalla/mpt-1b-redpajama-200b",
tokenizer_path="anas-awadalla/mpt-1b-redpajama-200b",
cross_attn_every_n_layers=1,
)
# Load raw images
demo_image_1 = Image.open("demo_cat.jpg")
demo_image_2 = Image.open("demo_dog.jpg")
query_image = Image.open("query_animal.jpg")
# Step 1: Apply CLIP preprocessing to each image individually
demo_1_tensor = image_processor(demo_image_1) # (C, H, W)
demo_2_tensor = image_processor(demo_image_2) # (C, H, W)
query_tensor = image_processor(query_image) # (C, H, W)
# Step 2: Stack along the T_img dimension (3 images in sequence)
all_images = torch.cat([
demo_1_tensor.unsqueeze(0),
demo_2_tensor.unsqueeze(0),
query_tensor.unsqueeze(0),
], dim=0) # (3, C, H, W)
# Step 3: Reshape to 6D format: (B, T_img, F, C, H, W)
# B=1 (single example), T_img=3 (two demos + one query), F=1 (still images)
vision_x = all_images.unsqueeze(0).unsqueeze(2) # (1, 3, 1, C, H, W)
# Now vision_x is ready for model.generate(vision_x=vision_x, ...)
Related Pages
Principle:Mlfoundations_Open_flamingo_Visual_Input_Preprocessing