Principle:Mlfoundations Open flamingo Vision Conditioned Text Generation
Overview
Autoregressive text generation conditioned on visual inputs through cross-attention between encoded image features and language model hidden states.
Description
The generation process in Flamingo-style models works by:
- Encoding images through the vision encoder and Perceiver resampler to produce compact visual tokens.
- Conditioning the language model's cross-attention layers on these visual tokens based on
<image>token positions. - Using standard autoregressive decoding (beam search, sampling) from the conditioned language model.
The key insight is that visual features are injected via gated cross-attention layers interleaved within the frozen LM decoder, allowing the model to generate text that references visual content.
Usage
When generating captions, answers, or descriptions from images using a vision-language model.
Theoretical Basis
The generation follows standard autoregressive decoding:
P(y_t | y_{<t}, x_img)
where x_img are the visual features. The visual conditioning happens through gated cross-attention:
output = LM_layer(x) + tanh(alpha) * CrossAttn(x, vision_features)
where alpha is initialized to 0 so the model starts as the original LM and gradually learns to attend to visual features.
The generate() method delegates to HuggingFace's generate infrastructure after setting up visual conditioning.