Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Mlfoundations Open flamingo Tokenizer with special tokens

From Leeroopedia


Template:Metadata

Overview

Wrapper pattern using HuggingFace AutoTokenizer with OpenFlamingo-specific special tokens for interleaved image-text tokenization.

Description

This is a Wrapper Doc. The tokenizer is not a custom class but rather a standard HuggingFace tokenizer instance that has been extended with additional special tokens specific to OpenFlamingo.

The tokenizer is created via AutoTokenizer.from_pretrained() inside the create_model_and_transforms() factory function and then extended with three special tokens:

  • <image> -- Visual placeholder token
  • <|endofchunk|> -- Image-text chunk delimiter token
  • <PAD> -- Padding token

After adding these tokens, the language model's token embeddings are resized to accommodate the new vocabulary entries. The tokenizer's padding side must be set to "left" for generation tasks, ensuring that generated tokens appear contiguously on the right.

The tokenizer is used as a standard HuggingFace tokenizer, typically invoked with return_tensors="pt" and padding="longest" to produce batched PyTorch tensors.

Usage

After obtaining the tokenizer from create_model_and_transforms(), tokenize text that contains <image> placeholders at positions corresponding to input images.

Code Reference

Source: Repository https://github.com/mlfoundations/open_flamingo, File: open_flamingo/src/factory.py Lines L50-63

Signature (wrapper usage):

# The tokenizer is a standard HuggingFace tokenizer with added special tokens
tokenizer(
    text: Union[str, List[str]],
    return_tensors: str = "pt",
    padding: str = "longest",
    truncation: bool = True,
    max_length: int = 2000,
) -> BatchEncoding  # Contains input_ids and attention_mask

Import: The tokenizer is returned by create_model_and_transforms(), not imported separately. Internally it uses from transformers import AutoTokenizer.

I/O Contract

Inputs

Name Type Required Description
text Union[str, List[str]] Yes Text with <image> placeholders marking image positions
return_tensors str No Return type format (default "pt")
padding str No Padding strategy (default "longest")
max_length int No Maximum sequence length

Outputs

Name Type Description
input_ids torch.Tensor Token IDs with shape (B, T_txt)
attention_mask torch.Tensor Attention mask with shape (B, T_txt)

Usage Examples

Below is an example of tokenizing a few-shot prompt with <image> tokens and <|endofchunk|> delimiters:

from open_flamingo import create_model_and_transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="anas-awadalla/mpt-1b-redpajama-200b",
    tokenizer_path="anas-awadalla/mpt-1b-redpajama-200b",
    cross_attn_every_n_layers=1,
)

# Set padding side to left for generation
tokenizer.padding_side = "left"

# Few-shot prompt with two image-text examples and a query image
prompt = (
    "<image>An image of two cats sleeping on a couch.<|endofchunk|>"
    "<image>An image of a dog playing fetch in the park.<|endofchunk|>"
    "<image>An image of"
)

# Tokenize the prompt
lang_x = tokenizer(
    [prompt],
    return_tensors="pt",
    padding="longest",
    truncation=True,
    max_length=2000,
)

# lang_x contains:
#   lang_x["input_ids"]      -> torch.Tensor of shape (1, T_txt)
#   lang_x["attention_mask"] -> torch.Tensor of shape (1, T_txt)

Related Pages

Principle:Mlfoundations_Open_flamingo_Text_Tokenization_For_Vision_Language

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment