Principle:Openai CLIP Text Tokenization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Preprocessing |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
A text encoding mechanism that converts raw text strings into fixed-length integer token sequences using Byte Pair Encoding with special start-of-text and end-of-text delimiters.
Description
Text Tokenization is the process of converting human-readable text into a sequence of integer token IDs that a neural network can process. CLIP uses a modified Byte Pair Encoding (BPE) scheme with a vocabulary of approximately 49,152 tokens. The tokenization pipeline consists of:
- Text cleaning: Fix Unicode errors (via ftfy), unescape HTML entities, and strip whitespace.
- Lowercasing: Convert all text to lowercase for case-insensitive matching.
- Regex splitting: Split text into tokens using a regex pattern that handles contractions, Unicode letters, numbers, and special tokens.
- Byte-level encoding: Map each byte of each token to a Unicode character (avoiding whitespace/control characters that cause issues).
- BPE merging: Iteratively merge the most frequent character pairs according to a pre-computed merge table (~49K merges).
- Special token wrapping: Prepend a start-of-text token (ID 49406) and append an end-of-text token (ID 49407).
- Padding: Pad or truncate the sequence to the fixed context length of 77 tokens.
The context length of 77 tokens is a hard constraint of the CLIP architecture, determined by the positional embedding size in the text transformer.
Usage
Use this principle whenever preparing text inputs for a CLIP model, whether for zero-shot classification labels (e.g., "a photo of a dog"), prompt-engineered templates, or any other text that needs to be encoded into CLIP's embedding space.
Theoretical Basis
Byte Pair Encoding (BPE) is a subword tokenization algorithm that balances vocabulary size against the ability to represent rare or unseen words:
# BPE algorithm (pseudo-code)
# 1. Start with character-level tokens
word = tuple("hello</w>") # ('h', 'e', 'l', 'l', 'o</w>')
# 2. Find the most frequent adjacent pair from the merge table
pairs = get_pairs(word)
best_pair = min(pairs, key=lambda p: merge_ranks.get(p, inf))
# 3. Merge that pair into a single token
# e.g., ('l', 'l') -> 'll'
# 4. Repeat until no more merges are available
# Result: a sequence of subword tokens
CLIP's BPE uses a byte-level encoding that maps all 256 possible byte values to Unicode characters, ensuring every input can be tokenized without unknown tokens. This is combined with the </w> end-of-word marker convention.
The fixed context length of 77 tokens matches the positional embedding dimension:
# Context length constraint
self.positional_embedding = nn.Parameter(
torch.empty(context_length, transformer_width) # context_length = 77
)