Implementation:Mlfoundations Open flamingo Tokenizer with special tokens
Overview
Wrapper pattern using HuggingFace AutoTokenizer with OpenFlamingo-specific special tokens for interleaved image-text tokenization.
Description
This is a Wrapper Doc. The tokenizer is not a custom class but rather a standard HuggingFace tokenizer instance that has been extended with additional special tokens specific to OpenFlamingo.
The tokenizer is created via AutoTokenizer.from_pretrained() inside the create_model_and_transforms() factory function and then extended with three special tokens:
<image>-- Visual placeholder token<|endofchunk|>-- Image-text chunk delimiter token<PAD>-- Padding token
After adding these tokens, the language model's token embeddings are resized to accommodate the new vocabulary entries. The tokenizer's padding side must be set to "left" for generation tasks, ensuring that generated tokens appear contiguously on the right.
The tokenizer is used as a standard HuggingFace tokenizer, typically invoked with return_tensors="pt" and padding="longest" to produce batched PyTorch tensors.
Usage
After obtaining the tokenizer from create_model_and_transforms(), tokenize text that contains <image> placeholders at positions corresponding to input images.
Code Reference
Source: Repository https://github.com/mlfoundations/open_flamingo, File: open_flamingo/src/factory.py Lines L50-63
Signature (wrapper usage):
# The tokenizer is a standard HuggingFace tokenizer with added special tokens
tokenizer(
text: Union[str, List[str]],
return_tensors: str = "pt",
padding: str = "longest",
truncation: bool = True,
max_length: int = 2000,
) -> BatchEncoding # Contains input_ids and attention_mask
Import: The tokenizer is returned by create_model_and_transforms(), not imported separately. Internally it uses from transformers import AutoTokenizer.
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | Union[str, List[str]] |
Yes | Text with <image> placeholders marking image positions
|
| return_tensors | str |
No | Return type format (default "pt")
|
| padding | str |
No | Padding strategy (default "longest")
|
| max_length | int |
No | Maximum sequence length |
Outputs
| Name | Type | Description |
|---|---|---|
| input_ids | torch.Tensor |
Token IDs with shape (B, T_txt)
|
| attention_mask | torch.Tensor |
Attention mask with shape (B, T_txt)
|
Usage Examples
Below is an example of tokenizing a few-shot prompt with <image> tokens and <|endofchunk|> delimiters:
from open_flamingo import create_model_and_transforms
model, image_processor, tokenizer = create_model_and_transforms(
clip_vision_encoder_path="ViT-L-14",
clip_vision_encoder_pretrained="openai",
lang_encoder_path="anas-awadalla/mpt-1b-redpajama-200b",
tokenizer_path="anas-awadalla/mpt-1b-redpajama-200b",
cross_attn_every_n_layers=1,
)
# Set padding side to left for generation
tokenizer.padding_side = "left"
# Few-shot prompt with two image-text examples and a query image
prompt = (
"<image>An image of two cats sleeping on a couch.<|endofchunk|>"
"<image>An image of a dog playing fetch in the park.<|endofchunk|>"
"<image>An image of"
)
# Tokenize the prompt
lang_x = tokenizer(
[prompt],
return_tensors="pt",
padding="longest",
truncation=True,
max_length=2000,
)
# lang_x contains:
# lang_x["input_ids"] -> torch.Tensor of shape (1, T_txt)
# lang_x["attention_mask"] -> torch.Tensor of shape (1, T_txt)
Related Pages
Principle:Mlfoundations_Open_flamingo_Text_Tokenization_For_Vision_Language