Workflow:Mlfoundations Open flamingo Model Creation And Inference

Knowledge Sources	OpenFlamingo OpenFlamingo Paper Flamingo Paper HuggingFace Models
Domains	Vision_Language_Models, Inference, Multimodal_AI
Last Updated	2026-02-08 03:30 GMT

Overview

End-to-end process for initializing an OpenFlamingo vision-language model and generating text conditioned on interleaved images and text using few-shot in-context learning.

Description

This workflow covers instantiating an OpenFlamingo model from pretrained components (CLIP vision encoder + HuggingFace causal language model), loading trained checkpoint weights, and performing multimodal text generation. The model architecture fuses visual and language representations through Perceiver Resampler compression and gated cross-attention layers injected into the frozen language model. The workflow supports few-shot image captioning, visual question answering, and other vision-language tasks via in-context examples.

Usage

Execute this workflow when you need to set up an OpenFlamingo model for inference on vision-language tasks such as image captioning, visual question answering, or hateful memes classification. You have access to a pretrained OpenFlamingo checkpoint (from HuggingFace Hub or local storage) and want to generate text conditioned on one or more input images with optional in-context demonstration examples.

Execution Steps

Step 1: Install Dependencies

Install the OpenFlamingo package and its core dependencies. The package requires PyTorch, OpenCLIP for the vision encoder, and HuggingFace Transformers for the language model. Install via pip or conda environment.

Key considerations:

The base package only requires core dependencies (torch, open_clip, transformers)
Training and evaluation extras can be installed separately
A conda environment definition is also available

Step 2: Initialize Model And Transforms

Create the OpenFlamingo model by specifying the CLIP vision encoder variant, the pretrained language model path, and the cross-attention injection frequency. The factory function assembles the full architecture: it loads the CLIP vision encoder, loads the HuggingFace causal language model, dynamically injects the FlamingoLMMixin (adding gated cross-attention layers), adds special tokens to the tokenizer, and applies the selective parameter freezing strategy.

Key considerations:

Vision encoder must be a valid OpenCLIP model (e.g., ViT-L-14)
Language model can be MPT, RedPajama, LLaMA, OPT, GPT-Neo, GPT-J, or Pythia
The cross_attn_every_n_layers parameter controls how often cross-attention is applied (must match the checkpoint)
Special tokens <image> and <|endofchunk|> are added to the tokenizer

Step 3: Load Pretrained Weights

Download and load a pretrained OpenFlamingo checkpoint. Checkpoints are available on HuggingFace Hub for various model sizes (3B, 4B, 9B). The checkpoint contains only the trainable parameters (Perceiver, cross-attention layers, and optionally embeddings), so loading uses strict=False to skip frozen backbone weights.

Key considerations:

Use huggingface_hub to download checkpoints
Load with strict=False since the checkpoint only contains trainable parameters
Ensure the checkpoint matches the model architecture configuration (encoder, LM, cross-attn interval)

Step 4: Prepare Visual Inputs

Process input images through the image processor pipeline returned by the factory function. Images must be shaped as a 6D tensor: (batch_size, num_media, num_frames, channels, height, width). For single-image inputs, num_media and num_frames are both 1. For few-shot prompting, stack demonstration images and the query image along the num_media dimension.

Key considerations:

Each image is processed independently through the CLIP image processor
Images are stacked along the media dimension (T_img)
Currently only single-frame (F=1) is supported
For few-shot, include demonstration images before the query image

Step 5: Prepare Text Inputs

Construct the text prompt with special tokens marking image locations and chunk boundaries. Each image in the sequence is preceded by an <image> token. Each image-text pair is terminated with an <|endofchunk|> token. For few-shot captioning, format demonstration examples as complete image-caption pairs followed by the query prompt. Set tokenizer padding to left side for generation.

Key considerations:

Use <image> to mark where each image appears in the sequence
Use <|endofchunk|> to mark the end of each image-text segment
Set padding_side="left" for generation tasks
The prompt format differs by task (captioning, VQA, classification)

Step 6: Generate Text

Call the model's generate method with the prepared vision and language tensors. The method encodes vision inputs through CLIP and the Perceiver Resampler, conditions the language model's cross-attention layers on the visual features, then runs autoregressive generation. Configure generation parameters such as beam search width, maximum tokens, and temperature.

Key considerations:

The generate method caches vision features for efficient autoregressive decoding
Beam search is supported (vision inputs are replicated across beams)
The end-of-chunk token serves as the EOS token during generation
Decode the output tensor using the tokenizer to get the final text

Execution Diagram

GitHub URL

Workflow Repository