Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlfoundations Open flamingo Flamingo generate

From Leeroopedia


Template:Metadata

Overview

Concrete tool for generating text conditioned on interleaved images and text provided by the OpenFlamingo Flamingo class.

Description

The Flamingo.generate() method encodes vision inputs through CLIP + Perceiver, conditions the language model's cross-attention layers based on <image> token positions in the text, then delegates to HuggingFace's generate() method for autoregressive decoding. After generation, conditioned layers are cleared to avoid memory leaks. Supports beam search, sampling, and other HuggingFace generation strategies via **kwargs.

Usage

After preparing vision_x tensor and tokenized text with <image> placeholders, call model.generate() to produce text.

Code Reference

Source
Repository https://github.com/mlfoundations/open_flamingo, File: open_flamingo/src/flamingo.py Lines L124-175
Signature
def generate(
    self,
    vision_x: torch.Tensor,
    lang_x: torch.Tensor,
    attention_mask: torch.Tensor = None,
    **kwargs,
) -> torch.Tensor
Import
from open_flamingo import create_model_and_transforms (model is created via factory, generate is a method)

I/O Contract

Inputs

Name Type Required Description
vision_x torch.Tensor Yes Vision input, shape (B, T_img, F, C, H, W) with F=1
lang_x torch.Tensor Yes Language input token IDs, shape (B, T_txt)
attention_mask torch.Tensor No Attention mask, shape (B, T_txt)
**kwargs dict No HuggingFace generate kwargs: max_new_tokens, num_beams, temperature, etc.

Outputs

Name Type Description
output_ids torch.Tensor Token IDs tensor — lang_x with generated tokens appended, shape (B, T_txt + generated_length)

Usage Examples

The following example demonstrates few-shot image captioning with 2 demo images and 1 query image:

from open_flamingo import create_model_and_transforms
from PIL import Image
import torch

# Create model and get image processor / tokenizer
model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="anas-awadalla/mpt-1b-redpajama-200b",
    tokenizer_path="anas-awadalla/mpt-1b-redpajama-200b",
    cross_attn_every_n_layers=1,
)

# Load demo images and query image
demo_image_1 = image_processor(Image.open("demo1.jpg")).unsqueeze(0)
demo_image_2 = image_processor(Image.open("demo2.jpg")).unsqueeze(0)
query_image  = image_processor(Image.open("query.jpg")).unsqueeze(0)

# Stack into vision_x: shape (B, T_img, F, C, H, W)
# T_img = 3 images, F = 1 frame each
vision_x = torch.stack([demo_image_1, demo_image_2, query_image], dim=1).unsqueeze(2)

# Tokenize the interleaved prompt with <image> placeholders
prompt = (
    "<image>A cat sitting on a windowsill.<|endofchunk|>"
    "<image>A dog playing in the park.<|endofchunk|>"
    "<image>"
)
lang_x = tokenizer(prompt, return_tensors="pt")

# Generate caption for the query image
output_ids = model.generate(
    vision_x=vision_x,
    lang_x=lang_x["input_ids"],
    attention_mask=lang_x["attention_mask"],
    max_new_tokens=20,
    num_beams=3,
)

# Decode the generated tokens
generated_text = tokenizer.decode(output_ids[0, lang_x["input_ids"].shape[1]:])
print(generated_text)
# Example output: "A bird perched on a tree branch."

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment