Implementation:Mlfoundations Open flamingo Create model and transforms
Appearance
Overview
Concrete tool for assembling a Flamingo vision-language model from pretrained CLIP and language model components provided by the OpenFlamingo library.
Description
The create_model_and_transforms function is the primary user-facing factory function in the OpenFlamingo library. It orchestrates the full model assembly pipeline:
- Loads a CLIP vision encoder via the
open_cliplibrary, using the specified architecture name (e.g."ViT-L-14") and pretrained weights (e.g."openai"). The corresponding image preprocessing transform is also obtained at this step. - Loads a HuggingFace causal language model from the specified model path (e.g.
"facebook/opt-1.3b") usingAutoModelForCausalLM. - Creates a Flamingo model that composes the vision encoder and language model with a Perceiver resampler and gated cross-attention layers. The cross-attention layers are injected into the language model decoder at a frequency controlled by
cross_attn_every_n_layers. - Adds special tokens to the tokenizer:
<image>(marking image positions in the input),<|endofchunk|>(delimiting few-shot examples), and<PAD>(padding token). The language model's token embeddings are resized accordingly. - Freezes backbone weights — both the vision encoder and the language model parameters are set to
requires_grad=False, so that only the Perceiver resampler and gated cross-attention layers are trainable.
Usage
Import this function when initializing an OpenFlamingo model for training or inference. It is the single entry point for constructing the complete model, image processor, and tokenizer.
Code Reference
- Source Location: Repository https://github.com/mlfoundations/open_flamingo, File:
open_flamingo/src/factory.py, Lines: L11-119 - Signature:
def create_model_and_transforms(
clip_vision_encoder_path: str,
clip_vision_encoder_pretrained: str,
lang_encoder_path: str,
tokenizer_path: str,
cross_attn_every_n_layers: int = 1,
use_local_files: bool = False,
decoder_layers_attr_name: str = None,
freeze_lm_embeddings: bool = False,
cache_dir: Optional[str] = None,
**flamingo_kwargs,
) -> Tuple[Flamingo, Callable, PreTrainedTokenizer]
- Import:
from open_flamingo import create_model_and_transforms
I/O Contract
Inputs
| Parameter | Type | Required | Description |
|---|---|---|---|
clip_vision_encoder_path |
str |
Yes | CLIP model name, e.g. "ViT-L-14"
|
clip_vision_encoder_pretrained |
str |
Yes | Pretrained dataset for the CLIP model, e.g. "openai"
|
lang_encoder_path |
str |
Yes | HuggingFace language model path, e.g. "facebook/opt-1.3b"
|
tokenizer_path |
str |
Yes | HuggingFace tokenizer path |
cross_attn_every_n_layers |
int |
No | Frequency of cross-attention layer injection (default 1)
|
use_local_files |
bool |
No | Use local files instead of downloading from remote (default False)
|
decoder_layers_attr_name |
str |
No | Name of the decoder layers attribute on the language model; auto-inferred if None
|
freeze_lm_embeddings |
bool |
No | Whether to freeze LM input embeddings (default False)
|
cache_dir |
str |
No | Cache directory for model downloads |
Outputs
| Name | Type | Description |
|---|---|---|
model |
Flamingo |
The assembled Flamingo model with frozen backbones and trainable bridging modules |
image_processor |
Callable |
CLIP image preprocessing pipeline (resize, normalize, etc.) |
tokenizer |
PreTrainedTokenizer |
endofchunk|>, <PAD>)
|
Usage Examples
Basic model creation:
from open_flamingo import create_model_and_transforms
model, image_processor, tokenizer = create_model_and_transforms(
clip_vision_encoder_path="ViT-L-14",
clip_vision_encoder_pretrained="openai",
lang_encoder_path="facebook/opt-1.3b",
tokenizer_path="facebook/opt-1.3b",
cross_attn_every_n_layers=1,
)
Inference setup with a pretrained checkpoint:
import torch
from open_flamingo import create_model_and_transforms
from huggingface_hub import hf_hub_download
# Create the model architecture
model, image_processor, tokenizer = create_model_and_transforms(
clip_vision_encoder_path="ViT-L-14",
clip_vision_encoder_pretrained="openai",
lang_encoder_path="facebook/opt-1.3b",
tokenizer_path="facebook/opt-1.3b",
cross_attn_every_n_layers=1,
)
# Load pretrained weights
checkpoint_path = hf_hub_download(
"openflamingo/OpenFlamingo-3B-vitl-mpt1b",
"checkpoint.pt",
)
model.load_state_dict(torch.load(checkpoint_path), strict=False)
model.eval()
# Prepare inputs
from PIL import Image
image = Image.open("example.jpg")
vision_x = image_processor(image).unsqueeze(0).unsqueeze(1).unsqueeze(0)
# shape: (batch, num_media, num_frames, channels, height, width)
tokenizer.padding_side = "left"
lang_x = tokenizer(
["<image>An image of"],
return_tensors="pt",
)
# Generate
generated_text = model.generate(
vision_x=vision_x,
lang_x=lang_x["input_ids"],
attention_mask=lang_x["attention_mask"],
max_new_tokens=20,
num_beams=3,
)
print(tokenizer.decode(generated_text[0]))
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment