Principle:Mlfoundations Open flamingo Visual Input Preprocessing
Overview
Preprocessing pattern that transforms raw images into normalized tensor batches with the specific shape convention required by vision-language models that process multiple images per sequence.
Description
Raw PIL images must be transformed through a CLIP-compatible image processor that applies a sequence of standard vision transforms: resize to match the expected spatial resolution, center crop to a square aspect ratio, and normalize using the CLIP channel-wise mean and standard deviation values. Each processed image becomes a 3-dimensional tensor of shape (C, H, W).
These individual tensors are then assembled into a 6-dimensional tensor with shape (B, T_img, F, C, H, W) where:
- B = batch size (number of independent examples)
- T_img = number of images per text sequence (supports variable-length image contexts, e.g., few-shot demonstrations)
- F = number of frames per image (set to 1 for still images; reserved for video support)
- C = channels (3 for RGB)
- H = height (after resize and crop)
- W = width (after resize and crop)
This shape convention allows the model to handle a variable number of images per text sequence, which is essential for in-context few-shot learning where demonstration image-text pairs precede the query image. The T_img dimension is what distinguishes this format from a standard image batch and enables the Perceiver resampler and gated cross-attention layers to align each image with its corresponding position in the interleaved text.
Usage
This preprocessing principle applies whenever preparing images for OpenFlamingo inference or training. It is required before any call to model.forward() or model.generate(). Every image input to the model must pass through this transformation pipeline and be reshaped into the 6D convention before being supplied as the vision_x argument.
Theoretical Basis
CLIP vision encoders expect normalized tensors as input, with pixel values scaled and shifted according to the statistics of the CLIP training distribution. Feeding raw or improperly normalized pixel data produces degraded or meaningless visual embeddings.
The 6D tensor convention (B, T_img, F, C, H, W) extends the standard 4D image batch format (B, C, H, W) to support two additional axes:
- Multiple images per sequence (the
T_imgdimension) — enabling the model to receive several images interleaved with text tokens within a single forward pass, which is the foundation of Flamingo-style few-shot prompting. - Video frames (the
Fdimension, currentlyF=1for still images) — providing a reserved axis for temporal frame data without requiring a change to the tensor contract.
This factored representation allows the Perceiver resampler to process each image independently along the T_img axis, producing a fixed number of visual tokens per image. These visual tokens are then injected into the language model via gated cross-attention fusion layers at designated positions in the text sequence. The separation of T_img and F ensures that spatial and temporal dimensions remain decoupled throughout the vision encoding pipeline.
Related Pages
Implementation:Mlfoundations_Open_flamingo_Image_processor_pipeline