Implementation:Openai CLIP Clip Tokenize

Knowledge Sources	OpenAI CLIP
Domains	NLP, Preprocessing
Last Updated	2026-02-13 22:00 GMT

Overview

Concrete tool for converting text strings to padded integer token tensors provided by the OpenAI CLIP library.

Description

The clip.tokenize() function takes one or more text strings and returns a 2D integer tensor of shape [N, 77], where N is the number of input strings and 77 is CLIP's fixed context length. Internally it uses a SimpleTokenizer (BPE with ~49K vocabulary loaded from a bundled compressed merge file) to encode each text, then wraps with start-of-text (49406) and end-of-text (49407) special tokens, and pads with zeros.

The function raises a RuntimeError if any input text exceeds the context length of 77 tokens and truncate is False (default). When truncate is True, it clips to 77 tokens and ensures the last token is the end-of-text marker.

Usage

Use this function to prepare text descriptions for CLIP's text encoder. Common inputs include zero-shot classification labels (e.g., "a photo of a cat"), prompt-engineered templates, or any text to be compared against images in CLIP's embedding space.

Code Reference

Source Location

Repository: OpenAI CLIP
File: clip/clip.py
Lines: L205-245

Signature

def tokenize(
    texts: Union[str, List[str]],
    context_length: int = 77,
    truncate: bool = False
) -> Union[torch.IntTensor, torch.LongTensor]:
    """Returns the tokenized representation of given input string(s).

    Parameters
    ----------
    texts : Union[str, List[str]]
        An input string or a list of input strings to tokenize.

    context_length : int
        The context length to use; all CLIP models use 77 as the
        context length.

    truncate : bool
        Whether to truncate the text in case its encoding is longer
        than the context length. Default: False (raises RuntimeError).

    Returns
    -------
    torch.Tensor
        A 2D tensor of shape [N, context_length] containing token IDs.
        Returns LongTensor for torch < 1.8.0, IntTensor otherwise.
    """

Import

import clip
# or
from clip import tokenize

I/O Contract

Inputs

Name	Type	Required	Description
texts	Union[str, List[str]]	Yes	One or more text strings to tokenize (e.g. "a photo of a dog")
context_length	int	No	Token sequence length. Default: 77 (must match the CLIP model)
truncate	bool	No	Whether to silently truncate long texts. Default: False (raises RuntimeError)

Outputs

Name	Type	Description
tokens	torch.Tensor	2D tensor of shape [N, 77] with BPE-encoded token IDs. Sequences are wrapped with SOT (49406) and EOT (49407) special tokens and zero-padded.

Usage Examples

Tokenizing Class Labels

import clip

# Tokenize a single string
tokens = clip.tokenize("a photo of a cat")
# tokens.shape: [1, 77]

# Tokenize multiple class descriptions
labels = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]
tokens = clip.tokenize(labels)
# tokens.shape: [3, 77]

Handling Long Text

import clip

# Long text will be truncated if truncate=True
long_text = "A very detailed description of an image " * 20
tokens = clip.tokenize(long_text, truncate=True)
# tokens.shape: [1, 77] — truncated to fit, last token is EOT

# Without truncate=True, raises RuntimeError
try:
    tokens = clip.tokenize(long_text)
except RuntimeError as e:
    print(f"Error: {e}")

Zero-shot Classification Setup

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Tokenize class descriptions and move to device
text_descriptions = [f"a photo of a {c}" for c in ["cat", "dog", "bird"]]
text_tokens = clip.tokenize(text_descriptions).to(device)
# text_tokens.shape: [3, 77]

# Encode with the model
with torch.no_grad():
    text_features = model.encode_text(text_tokens)

Related Pages

Implements Principle

Principle:Openai_CLIP_Text_Tokenization

Requires Environment

Environment:Openai_CLIP_Python_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment