Implementation:Mlfoundations Open flamingo Tokenizer with special tokens

Overview

Wrapper pattern using HuggingFace AutoTokenizer with OpenFlamingo-specific special tokens for interleaved image-text tokenization.

Description

This is a Wrapper Doc. The tokenizer is not a custom class but rather a standard HuggingFace tokenizer instance that has been extended with additional special tokens specific to OpenFlamingo.

The tokenizer is created via AutoTokenizer.from_pretrained() inside the create_model_and_transforms() factory function and then extended with three special tokens:

<image> -- Visual placeholder token
<|endofchunk|> -- Image-text chunk delimiter token
<PAD> -- Padding token

After adding these tokens, the language model's token embeddings are resized to accommodate the new vocabulary entries. The tokenizer's padding side must be set to "left" for generation tasks, ensuring that generated tokens appear contiguously on the right.

The tokenizer is used as a standard HuggingFace tokenizer, typically invoked with return_tensors="pt" and padding="longest" to produce batched PyTorch tensors.

Usage

After obtaining the tokenizer from create_model_and_transforms(), tokenize text that contains <image> placeholders at positions corresponding to input images.

Code Reference

Source: Repository https://github.com/mlfoundations/open_flamingo, File: open_flamingo/src/factory.py Lines L50-63

Signature (wrapper usage):

# The tokenizer is a standard HuggingFace tokenizer with added special tokens
tokenizer(
    text: Union[str, List[str]],
    return_tensors: str = "pt",
    padding: str = "longest",
    truncation: bool = True,
    max_length: int = 2000,
) -> BatchEncoding  # Contains input_ids and attention_mask

Import: The tokenizer is returned by create_model_and_transforms(), not imported separately. Internally it uses from transformers import AutoTokenizer.

I/O Contract

Inputs

Name	Type	Required	Description
text	`Union[str, List[str]]`	Yes	Text with `<image>` placeholders marking image positions
return_tensors	`str`	No	Return type format (default `"pt"`)
padding	`str`	No	Padding strategy (default `"longest"`)
max_length	`int`	No	Maximum sequence length

Outputs

Name	Type	Description
input_ids	`torch.Tensor`	Token IDs with shape `(B, T_txt)`
attention_mask	`torch.Tensor`	Attention mask with shape `(B, T_txt)`

Usage Examples

Below is an example of tokenizing a few-shot prompt with <image> tokens and <|endofchunk|> delimiters:

from open_flamingo import create_model_and_transforms

model, image_processor, tokenizer = create_model_and_transforms(
    clip_vision_encoder_path="ViT-L-14",
    clip_vision_encoder_pretrained="openai",
    lang_encoder_path="anas-awadalla/mpt-1b-redpajama-200b",
    tokenizer_path="anas-awadalla/mpt-1b-redpajama-200b",
    cross_attn_every_n_layers=1,
)

# Set padding side to left for generation
tokenizer.padding_side = "left"

# Few-shot prompt with two image-text examples and a query image
prompt = (
    "<image>An image of two cats sleeping on a couch.<|endofchunk|>"
    "<image>An image of a dog playing fetch in the park.<|endofchunk|>"
    "<image>An image of"
)

# Tokenize the prompt
lang_x = tokenizer(
    [prompt],
    return_tensors="pt",
    padding="longest",
    truncation=True,
    max_length=2000,
)

# lang_x contains:
#   lang_x["input_ids"]      -> torch.Tensor of shape (1, T_txt)
#   lang_x["attention_mask"] -> torch.Tensor of shape (1, T_txt)

Environment:Mlfoundations_Open_flamingo_HuggingFace_Open_CLIP_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment