Principle:Mlfoundations Open flamingo Model Initialization

Overview

Architectural pattern that assembles a vision-language model by composing a frozen vision encoder, a frozen language model, and trainable bridging modules (Perceiver resampler + gated cross-attention layers).

Description

Flamingo-style models are initialized by composing two large pretrained backbones — a CLIP vision encoder and a causal language model — together with lightweight, trainable bridging modules. The initialization procedure works as follows:

A pretrained CLIP vision encoder (e.g. ViT-L/14) is loaded and its weights are frozen. This encoder is responsible for producing per-image visual feature maps.
A pretrained causal language model (e.g. OPT, LLaMA, or MPT) is loaded and its weights are also frozen. This model provides the text-generation backbone.
Gated cross-attention layers are injected into the language model's decoder stack at a configurable frequency (every n layers). These layers allow the language model to attend to visual features while generating text.
A Perceiver resampler module is instantiated to compress the variable-length visual token sequence from CLIP into a fixed-length set of latent visual tokens, reducing the computational cost of cross-attention.
The injection of cross-attention layers into the language model follows a dynamic mixin pattern: the original decoder layer classes are wrapped or extended at runtime so that each selected layer gains an additional cross-attention sub-layer, without modifying the underlying pretrained model class definition. This keeps the frozen LM architecture intact while adding trainable capacity.
All backbone weights (vision encoder and language model) remain frozen during training; only the Perceiver resampler and gated cross-attention layers are updated, dramatically reducing the number of trainable parameters.

Usage

Use this principle when building a vision-language model that processes interleaved image-text sequences — for example, few-shot visual question answering, image captioning with in-context examples, or any task where images and text are interspersed in a single input stream.

Theoretical Basis

The architecture follows a three-stage composition pipeline:

1. Visual Encoding (CLIP)

The frozen CLIP vision encoder processes each input image independently, producing a grid of visual feature tokens. Because CLIP is trained with a contrastive image-text objective, these features already carry semantically rich, language-aligned representations.

2. Perceiver Resampler

The Perceiver resampler takes the variable-length sequence of visual tokens output by CLIP and compresses them into a fixed-length set of latent visual tokens (typically 64). It achieves this through a stack of cross-attention layers where a learned set of latent queries attend to the visual features. This fixed-length output is critical for efficiency: regardless of image resolution or patch count, the language model always cross-attends to the same number of visual tokens.

3. Gated Cross-Attention in the Language Model Decoder

Gated cross-attention layers are interleaved into the frozen language model's decoder stack. The key design choices are:

Cross-attention frequency parameter (cross_attn_every_n_layers): Rather than adding cross-attention to every decoder layer, cross-attention is inserted every n layers (e.g. every 1, 2, or 4 layers). This controls the trade-off between visual grounding and computational cost.
Tanh-gated residual connections initialized at zero: Each gated cross-attention layer uses a learnable gating scalar, passed through a tanh activation, that multiplies the cross-attention output before it is added as a residual. Crucially, this gate is initialized to zero, meaning that at the start of training the cross-attention contributes nothing and the language model behaves exactly as if it were unmodified. This ensures stable training initialization and allows the model to gradually learn to incorporate visual information.
Selective freezing: The original self-attention and feed-forward weights within each decoder layer remain frozen. Only the newly added cross-attention parameters and gating scalars are trainable.

This design enables efficient adaptation of very large language models to multimodal tasks while preserving the original language modeling capabilities and requiring only a fraction of the total parameters to be trained.

Related Pages

Implementation:Mlfoundations_Open_flamingo_Create_model_and_transforms

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment