Implementation:Openai CLIP Clip Tokenize
| Knowledge Sources | |
|---|---|
| Domains | NLP, Preprocessing |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
Concrete tool for converting text strings to padded integer token tensors provided by the OpenAI CLIP library.
Description
The clip.tokenize() function takes one or more text strings and returns a 2D integer tensor of shape [N, 77], where N is the number of input strings and 77 is CLIP's fixed context length. Internally it uses a SimpleTokenizer (BPE with ~49K vocabulary loaded from a bundled compressed merge file) to encode each text, then wraps with start-of-text (49406) and end-of-text (49407) special tokens, and pads with zeros.
The function raises a RuntimeError if any input text exceeds the context length of 77 tokens and truncate is False (default). When truncate is True, it clips to 77 tokens and ensures the last token is the end-of-text marker.
Usage
Use this function to prepare text descriptions for CLIP's text encoder. Common inputs include zero-shot classification labels (e.g., "a photo of a cat"), prompt-engineered templates, or any text to be compared against images in CLIP's embedding space.
Code Reference
Source Location
- Repository: OpenAI CLIP
- File: clip/clip.py
- Lines: L205-245
Signature
def tokenize(
texts: Union[str, List[str]],
context_length: int = 77,
truncate: bool = False
) -> Union[torch.IntTensor, torch.LongTensor]:
"""Returns the tokenized representation of given input string(s).
Parameters
----------
texts : Union[str, List[str]]
An input string or a list of input strings to tokenize.
context_length : int
The context length to use; all CLIP models use 77 as the
context length.
truncate : bool
Whether to truncate the text in case its encoding is longer
than the context length. Default: False (raises RuntimeError).
Returns
-------
torch.Tensor
A 2D tensor of shape [N, context_length] containing token IDs.
Returns LongTensor for torch < 1.8.0, IntTensor otherwise.
"""
Import
import clip
# or
from clip import tokenize
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| texts | Union[str, List[str]] | Yes | One or more text strings to tokenize (e.g. "a photo of a dog") |
| context_length | int | No | Token sequence length. Default: 77 (must match the CLIP model) |
| truncate | bool | No | Whether to silently truncate long texts. Default: False (raises RuntimeError) |
Outputs
| Name | Type | Description |
|---|---|---|
| tokens | torch.Tensor | 2D tensor of shape [N, 77] with BPE-encoded token IDs. Sequences are wrapped with SOT (49406) and EOT (49407) special tokens and zero-padded. |
Usage Examples
Tokenizing Class Labels
import clip
# Tokenize a single string
tokens = clip.tokenize("a photo of a cat")
# tokens.shape: [1, 77]
# Tokenize multiple class descriptions
labels = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]
tokens = clip.tokenize(labels)
# tokens.shape: [3, 77]
Handling Long Text
import clip
# Long text will be truncated if truncate=True
long_text = "A very detailed description of an image " * 20
tokens = clip.tokenize(long_text, truncate=True)
# tokens.shape: [1, 77] — truncated to fit, last token is EOT
# Without truncate=True, raises RuntimeError
try:
tokens = clip.tokenize(long_text)
except RuntimeError as e:
print(f"Error: {e}")
Zero-shot Classification Setup
import clip
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Tokenize class descriptions and move to device
text_descriptions = [f"a photo of a {c}" for c in ["cat", "dog", "bird"]]
text_tokens = clip.tokenize(text_descriptions).to(device)
# text_tokens.shape: [3, 77]
# Encode with the model
with torch.no_grad():
text_features = model.encode_text(text_tokens)