Principle:Mlfoundations Open flamingo Model Initialization
Overview
Architectural pattern that assembles a vision-language model by composing a frozen vision encoder, a frozen language model, and trainable bridging modules (Perceiver resampler + gated cross-attention layers).
Description
Flamingo-style models are initialized by composing two large pretrained backbones — a CLIP vision encoder and a causal language model — together with lightweight, trainable bridging modules. The initialization procedure works as follows:
- A pretrained CLIP vision encoder (e.g. ViT-L/14) is loaded and its weights are frozen. This encoder is responsible for producing per-image visual feature maps.
- A pretrained causal language model (e.g. OPT, LLaMA, or MPT) is loaded and its weights are also frozen. This model provides the text-generation backbone.
- Gated cross-attention layers are injected into the language model's decoder stack at a configurable frequency (every n layers). These layers allow the language model to attend to visual features while generating text.
- A Perceiver resampler module is instantiated to compress the variable-length visual token sequence from CLIP into a fixed-length set of latent visual tokens, reducing the computational cost of cross-attention.
- The injection of cross-attention layers into the language model follows a dynamic mixin pattern: the original decoder layer classes are wrapped or extended at runtime so that each selected layer gains an additional cross-attention sub-layer, without modifying the underlying pretrained model class definition. This keeps the frozen LM architecture intact while adding trainable capacity.
- All backbone weights (vision encoder and language model) remain frozen during training; only the Perceiver resampler and gated cross-attention layers are updated, dramatically reducing the number of trainable parameters.
Usage
Use this principle when building a vision-language model that processes interleaved image-text sequences — for example, few-shot visual question answering, image captioning with in-context examples, or any task where images and text are interspersed in a single input stream.
Theoretical Basis
The architecture follows a three-stage composition pipeline:
1. Visual Encoding (CLIP)
The frozen CLIP vision encoder processes each input image independently, producing a grid of visual feature tokens. Because CLIP is trained with a contrastive image-text objective, these features already carry semantically rich, language-aligned representations.
2. Perceiver Resampler
The Perceiver resampler takes the variable-length sequence of visual tokens output by CLIP and compresses them into a fixed-length set of latent visual tokens (typically 64). It achieves this through a stack of cross-attention layers where a learned set of latent queries attend to the visual features. This fixed-length output is critical for efficiency: regardless of image resolution or patch count, the language model always cross-attends to the same number of visual tokens.
3. Gated Cross-Attention in the Language Model Decoder
Gated cross-attention layers are interleaved into the frozen language model's decoder stack. The key design choices are:
- Cross-attention frequency parameter (
cross_attn_every_n_layers): Rather than adding cross-attention to every decoder layer, cross-attention is inserted every n layers (e.g. every 1, 2, or 4 layers). This controls the trade-off between visual grounding and computational cost. - Tanh-gated residual connections initialized at zero: Each gated cross-attention layer uses a learnable gating scalar, passed through a
tanhactivation, that multiplies the cross-attention output before it is added as a residual. Crucially, this gate is initialized to zero, meaning that at the start of training the cross-attention contributes nothing and the language model behaves exactly as if it were unmodified. This ensures stable training initialization and allows the model to gradually learn to incorporate visual information. - Selective freezing: The original self-attention and feed-forward weights within each decoder layer remain frozen. Only the newly added cross-attention parameters and gating scalars are trainable.
This design enables efficient adaptation of very large language models to multimodal tasks while preserving the original language modeling capabilities and requiring only a fraction of the total parameters to be trained.