Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlfoundations Open flamingo Create model and transforms

From Leeroopedia


Template:Metadata

Overview

Concrete tool for assembling a Flamingo vision-language model from pretrained CLIP and language model components provided by the OpenFlamingo library.

Description

The create_model_and_transforms function is the primary user-facing factory function in the OpenFlamingo library. It orchestrates the full model assembly pipeline:

  1. Loads a CLIP vision encoder via the open_clip library, using the specified architecture name (e.g. "ViT-L-14") and pretrained weights (e.g. "openai"). The corresponding image preprocessing transform is also obtained at this step.
  2. Loads a HuggingFace causal language model from the specified model path (e.g. "facebook/opt-1.3b") using AutoModelForCausalLM.
  3. Creates a Flamingo model that composes the vision encoder and language model with a Perceiver resampler and gated cross-attention layers. The cross-attention layers are injected into the language model decoder at a frequency controlled by cross_attn_every_n_layers.
  4. Adds special tokens to the tokenizer: <image> (marking image positions in the input), <|endofchunk|> (delimiting few-shot examples), and <PAD> (padding token). The language model's token embeddings are resized accordingly.
  5. Freezes backbone weights — both the vision encoder and the language model parameters are set to requires_grad=False, so that only the Perceiver resampler and gated cross-attention layers are trainable.

Usage

Import this function when initializing an OpenFlamingo model for training or inference. It is the single entry point for constructing the complete model, image processor, and tokenizer.

Code Reference

def create_model_and_transforms(
    clip_vision_encoder_path: str,
    clip_vision_encoder_pretrained: str,
    lang_encoder_path: str,
    tokenizer_path: str,
    cross_attn_every_n_layers: int = 1,
    use_local_files: bool = False,
    decoder_layers_attr_name: str = None,
    freeze_lm_embeddings: bool = False,
    cache_dir: Optional[str] = None,
    **flamingo_kwargs,
) -> Tuple[Flamingo, Callable, PreTrainedTokenizer]
  • Import:
from open_flamingo import create_model_and_transforms

I/O Contract

Inputs

Parameter Type Required Description
clip_vision_encoder_path str Yes CLIP model name, e.g. "ViT-L-14"
clip_vision_encoder_pretrained str Yes Pretrained dataset for the CLIP model, e.g. "openai"
lang_encoder_path str Yes HuggingFace language model path, e.g. "facebook/opt-1.3b"
tokenizer_path str Yes HuggingFace tokenizer path
cross_attn_every_n_layers int No Frequency of cross-attention layer injection (default 1)
use_local_files bool No Use local files instead of downloading from remote (default False)
decoder_layers_attr_name str No Name of the decoder layers attribute on the language model; auto-inferred if None
freeze_lm_embeddings bool No Whether to freeze LM input embeddings (default False)
cache_dir str No Cache directory for model downloads

Outputs

Name Type Description
model Flamingo The assembled Flamingo model with frozen backbones and trainable bridging modules
image_processor Callable CLIP image preprocessing pipeline (resize, normalize, etc.)
tokenizer PreTrainedTokenizer endofchunk|>, <PAD>)

Usage Examples

Basic model creation:

from open_flamingo import create_model_and_transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="facebook/opt-1.3b",
    tokenizer_path="facebook/opt-1.3b",
    cross_attn_every_n_layers=1,
)

Inference setup with a pretrained checkpoint:

import torch
from open_flamingo import create_model_and_transforms
from huggingface_hub import hf_hub_download

# Create the model architecture
model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="facebook/opt-1.3b",
    tokenizer_path="facebook/opt-1.3b",
    cross_attn_every_n_layers=1,
)

# Load pretrained weights
checkpoint_path = hf_hub_download(
    "openflamingo/OpenFlamingo-3B-vitl-mpt1b",
    "checkpoint.pt",
)
model.load_state_dict(torch.load(checkpoint_path), strict=False)
model.eval()

# Prepare inputs
from PIL import Image
image = Image.open("example.jpg")
vision_x = image_processor(image).unsqueeze(0).unsqueeze(1).unsqueeze(0)
# shape: (batch, num_media, num_frames, channels, height, width)

tokenizer.padding_side = "left"
lang_x = tokenizer(
    ["<image>An image of"],
    return_tensors="pt",
)

# Generate
generated_text = model.generate(
    vision_x=vision_x,
    lang_x=lang_x["input_ids"],
    attention_mask=lang_x["attention_mask"],
    max_new_tokens=20,
    num_beams=3,
)
print(tokenizer.decode(generated_text[0]))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment