Principle:Openai CLIP Text Tokenization

Knowledge Sources	Learning Transferable Visual Models From Natural Language Supervision Neural Machine Translation of Rare Words with Subword Units
Domains	NLP, Preprocessing
Last Updated	2026-02-13 22:00 GMT

Overview

A text encoding mechanism that converts raw text strings into fixed-length integer token sequences using Byte Pair Encoding with special start-of-text and end-of-text delimiters.

Description

Text Tokenization is the process of converting human-readable text into a sequence of integer token IDs that a neural network can process. CLIP uses a modified Byte Pair Encoding (BPE) scheme with a vocabulary of approximately 49,152 tokens. The tokenization pipeline consists of:

Text cleaning: Fix Unicode errors (via ftfy), unescape HTML entities, and strip whitespace.
Lowercasing: Convert all text to lowercase for case-insensitive matching.
Regex splitting: Split text into tokens using a regex pattern that handles contractions, Unicode letters, numbers, and special tokens.
Byte-level encoding: Map each byte of each token to a Unicode character (avoiding whitespace/control characters that cause issues).
BPE merging: Iteratively merge the most frequent character pairs according to a pre-computed merge table (~49K merges).
Special token wrapping: Prepend a start-of-text token (ID 49406) and append an end-of-text token (ID 49407).
Padding: Pad or truncate the sequence to the fixed context length of 77 tokens.

The context length of 77 tokens is a hard constraint of the CLIP architecture, determined by the positional embedding size in the text transformer.

Usage

Use this principle whenever preparing text inputs for a CLIP model, whether for zero-shot classification labels (e.g., "a photo of a dog"), prompt-engineered templates, or any other text that needs to be encoded into CLIP's embedding space.

Theoretical Basis

Byte Pair Encoding (BPE) is a subword tokenization algorithm that balances vocabulary size against the ability to represent rare or unseen words:

# BPE algorithm (pseudo-code)
# 1. Start with character-level tokens
word = tuple("hello</w>")  # ('h', 'e', 'l', 'l', 'o</w>')

# 2. Find the most frequent adjacent pair from the merge table
pairs = get_pairs(word)
best_pair = min(pairs, key=lambda p: merge_ranks.get(p, inf))

# 3. Merge that pair into a single token
# e.g., ('l', 'l') -> 'll'

# 4. Repeat until no more merges are available
# Result: a sequence of subword tokens

CLIP's BPE uses a byte-level encoding that maps all 256 possible byte values to Unicode characters, ensuring every input can be tokenized without unknown tokens. This is combined with the </w> end-of-word marker convention.

The fixed context length of 77 tokens matches the positional embedding dimension:

# Context length constraint
self.positional_embedding = nn.Parameter(
    torch.empty(context_length, transformer_width)  # context_length = 77
)

Related Pages

Implemented By

Implementation:Openai_CLIP_Clip_Tokenize

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment