Principle:Mlfoundations Open flamingo Vision Conditioned Text Generation

Overview

Autoregressive text generation conditioned on visual inputs through cross-attention between encoded image features and language model hidden states.

Description

The generation process in Flamingo-style models works by:

Encoding images through the vision encoder and Perceiver resampler to produce compact visual tokens.
Conditioning the language model's cross-attention layers on these visual tokens based on <image> token positions.
Using standard autoregressive decoding (beam search, sampling) from the conditioned language model.

The key insight is that visual features are injected via gated cross-attention layers interleaved within the frozen LM decoder, allowing the model to generate text that references visual content.

Usage

When generating captions, answers, or descriptions from images using a vision-language model.

Theoretical Basis

The generation follows standard autoregressive decoding:

P(y_t | y_{<t}, x_img)

where x_img are the visual features. The visual conditioning happens through gated cross-attention:

output = LM_layer(x) + tanh(alpha) * CrossAttn(x, vision_features)

where alpha is initialized to 0 so the model starts as the original LM and gradually learns to attend to visual features.

The generate() method delegates to HuggingFace's generate infrastructure after setting up visual conditioning.

Related Pages

Implementation:Mlfoundations_Open_flamingo_Flamingo_generate

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment