Principle:Mlfoundations Open flamingo Text Tokenization For Vision Language
Overview
Tokenization strategy that extends standard language model tokenizers with special visual placeholder tokens to enable interleaved image-text processing.
Description
Vision-language models require a mechanism to indicate where images appear within text sequences. Standard language model tokenizers have no concept of visual inputs, so special tokens must be introduced to bridge the modality gap.
OpenFlamingo extends the base language model tokenizer by adding three special tokens to the vocabulary:
<image>-- Marks the position in the text sequence where a corresponding image should be attended to. Each<image>token in the input text must correspond to an image provided in the visual input tensor. The model uses this positional marker to align text tokens with their associated visual features.<|endofchunk|>-- Marks the end of an image-text chunk. This delimiter is critical for multi-shot and interleaved prompts, where multiple image-text pairs are concatenated into a single sequence. It signals the boundary between one image-text example and the next.<PAD>-- A dedicated padding token used to equalize sequence lengths within a batch. This is added because not all base language model tokenizers include a padding token by default.
The text input must contain <image> tokens at positions corresponding to the input images. If two images are provided, the text must contain exactly two <image> tokens, each placed at the appropriate location where that image is referenced.
For generation tasks, left-padding is used (i.e., tokenizer.padding_side = "left") so that generated tokens appear contiguously on the right side of the sequence without padding tokens interspersed in the output.
Usage
When preparing text inputs that reference images for OpenFlamingo inference or training. Any text prompt intended for the model must include <image> placeholder tokens at positions where the corresponding images should be conditioned upon.
Theoretical Basis
The <image> token serves as a positional marker that triggers cross-attention conditioning in the language model. When the model encounters <image> tokens during the forward pass, it conditions the subsequent text generation on the corresponding visual features via gated cross-attention layers. These layers are interleaved between the frozen language model layers and allow visual information to influence text generation at specific points in the sequence.
The <|endofchunk|> token delineates boundaries between image-text pairs in multi-shot prompts. This is essential for the model to understand where one example ends and another begins, enabling proper in-context learning from multiple image-text demonstrations. Without this delimiter, the model would have no way to distinguish separate examples within a concatenated prompt.
Related Pages
Implementation:Mlfoundations_Open_flamingo_Tokenizer_with_special_tokens