Implementation:Mlfoundations Open flamingo Create model and transforms

Overview

Concrete tool for assembling a Flamingo vision-language model from pretrained CLIP and language model components provided by the OpenFlamingo library.

Description

The create_model_and_transforms function is the primary user-facing factory function in the OpenFlamingo library. It orchestrates the full model assembly pipeline:

Loads a CLIP vision encoder via the open_clip library, using the specified architecture name (e.g. "ViT-L-14") and pretrained weights (e.g. "openai"). The corresponding image preprocessing transform is also obtained at this step.
Loads a HuggingFace causal language model from the specified model path (e.g. "facebook/opt-1.3b") using AutoModelForCausalLM.
Creates a Flamingo model that composes the vision encoder and language model with a Perceiver resampler and gated cross-attention layers. The cross-attention layers are injected into the language model decoder at a frequency controlled by cross_attn_every_n_layers.
Adds special tokens to the tokenizer: <image> (marking image positions in the input), <|endofchunk|> (delimiting few-shot examples), and <PAD> (padding token). The language model's token embeddings are resized accordingly.
Freezes backbone weights — both the vision encoder and the language model parameters are set to requires_grad=False, so that only the Perceiver resampler and gated cross-attention layers are trainable.

Usage

Import this function when initializing an OpenFlamingo model for training or inference. It is the single entry point for constructing the complete model, image processor, and tokenizer.

Code Reference

Source Location: Repository https://github.com/mlfoundations/open_flamingo, File: open_flamingo/src/factory.py, Lines: L11-119
Signature:

def create_model_and_transforms(
    clip_vision_encoder_path: str,
    clip_vision_encoder_pretrained: str,
    lang_encoder_path: str,
    tokenizer_path: str,
    cross_attn_every_n_layers: int = 1,
    use_local_files: bool = False,
    decoder_layers_attr_name: str = None,
    freeze_lm_embeddings: bool = False,
    cache_dir: Optional[str] = None,
    **flamingo_kwargs,
) -> Tuple[Flamingo, Callable, PreTrainedTokenizer]

Import:

from open_flamingo import create_model_and_transforms

I/O Contract

Inputs

Parameter	Type	Required	Description
`clip_vision_encoder_path`	`str`	Yes	CLIP model name, e.g. `"ViT-L-14"`
`clip_vision_encoder_pretrained`	`str`	Yes	Pretrained dataset for the CLIP model, e.g. `"openai"`
`lang_encoder_path`	`str`	Yes	HuggingFace language model path, e.g. `"facebook/opt-1.3b"`
`tokenizer_path`	`str`	Yes	HuggingFace tokenizer path
`cross_attn_every_n_layers`	`int`	No	Frequency of cross-attention layer injection (default `1`)
`use_local_files`	`bool`	No	Use local files instead of downloading from remote (default `False`)
`decoder_layers_attr_name`	`str`	No	Name of the decoder layers attribute on the language model; auto-inferred if `None`
`freeze_lm_embeddings`	`bool`	No	Whether to freeze LM input embeddings (default `False`)
`cache_dir`	`str`	No	Cache directory for model downloads

Outputs

Name	Type	Description
`model`	`Flamingo`	The assembled Flamingo model with frozen backbones and trainable bridging modules
`image_processor`	`Callable`	CLIP image preprocessing pipeline (resize, normalize, etc.)
`tokenizer`	`PreTrainedTokenizer`	endofchunk\|>, `<PAD>`)

Usage Examples

Basic model creation:

from open_flamingo import create_model_and_transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="facebook/opt-1.3b",
    tokenizer_path="facebook/opt-1.3b",
    cross_attn_every_n_layers=1,
)

Inference setup with a pretrained checkpoint:

import torch
from open_flamingo import create_model_and_transforms
from huggingface_hub import hf_hub_download

# Create the model architecture
model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="facebook/opt-1.3b",
    tokenizer_path="facebook/opt-1.3b",
    cross_attn_every_n_layers=1,
)

# Load pretrained weights
checkpoint_path = hf_hub_download(
    "openflamingo/OpenFlamingo-3B-vitl-mpt1b",
    "checkpoint.pt",
)
model.load_state_dict(torch.load(checkpoint_path), strict=False)
model.eval()

# Prepare inputs
from PIL import Image
image = Image.open("example.jpg")
vision_x = image_processor(image).unsqueeze(0).unsqueeze(1).unsqueeze(0)
# shape: (batch, num_media, num_frames, channels, height, width)

tokenizer.padding_side = "left"
lang_x = tokenizer(
    ["<image>An image of"],
    return_tensors="pt",
)

# Generate
generated_text = model.generate(
    vision_x=vision_x,
    lang_x=lang_x["input_ids"],
    attention_mask=lang_x["attention_mask"],
    max_new_tokens=20,
    num_beams=3,
)
print(tokenizer.decode(generated_text[0]))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment