Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Openai CLIP Clip Tokenize

From Leeroopedia
Knowledge Sources
Domains NLP, Preprocessing
Last Updated 2026-02-13 22:00 GMT

Overview

Concrete tool for converting text strings to padded integer token tensors provided by the OpenAI CLIP library.

Description

The clip.tokenize() function takes one or more text strings and returns a 2D integer tensor of shape [N, 77], where N is the number of input strings and 77 is CLIP's fixed context length. Internally it uses a SimpleTokenizer (BPE with ~49K vocabulary loaded from a bundled compressed merge file) to encode each text, then wraps with start-of-text (49406) and end-of-text (49407) special tokens, and pads with zeros.

The function raises a RuntimeError if any input text exceeds the context length of 77 tokens and truncate is False (default). When truncate is True, it clips to 77 tokens and ensures the last token is the end-of-text marker.

Usage

Use this function to prepare text descriptions for CLIP's text encoder. Common inputs include zero-shot classification labels (e.g., "a photo of a cat"), prompt-engineered templates, or any text to be compared against images in CLIP's embedding space.

Code Reference

Source Location

  • Repository: OpenAI CLIP
  • File: clip/clip.py
  • Lines: L205-245

Signature

def tokenize(
    texts: Union[str, List[str]],
    context_length: int = 77,
    truncate: bool = False
) -> Union[torch.IntTensor, torch.LongTensor]:
    """Returns the tokenized representation of given input string(s).

    Parameters
    ----------
    texts : Union[str, List[str]]
        An input string or a list of input strings to tokenize.

    context_length : int
        The context length to use; all CLIP models use 77 as the
        context length.

    truncate : bool
        Whether to truncate the text in case its encoding is longer
        than the context length. Default: False (raises RuntimeError).

    Returns
    -------
    torch.Tensor
        A 2D tensor of shape [N, context_length] containing token IDs.
        Returns LongTensor for torch < 1.8.0, IntTensor otherwise.
    """

Import

import clip
# or
from clip import tokenize

I/O Contract

Inputs

Name Type Required Description
texts Union[str, List[str]] Yes One or more text strings to tokenize (e.g. "a photo of a dog")
context_length int No Token sequence length. Default: 77 (must match the CLIP model)
truncate bool No Whether to silently truncate long texts. Default: False (raises RuntimeError)

Outputs

Name Type Description
tokens torch.Tensor 2D tensor of shape [N, 77] with BPE-encoded token IDs. Sequences are wrapped with SOT (49406) and EOT (49407) special tokens and zero-padded.

Usage Examples

Tokenizing Class Labels

import clip

# Tokenize a single string
tokens = clip.tokenize("a photo of a cat")
# tokens.shape: [1, 77]

# Tokenize multiple class descriptions
labels = ["a photo of a cat", "a photo of a dog", "a photo of a bird"]
tokens = clip.tokenize(labels)
# tokens.shape: [3, 77]

Handling Long Text

import clip

# Long text will be truncated if truncate=True
long_text = "A very detailed description of an image " * 20
tokens = clip.tokenize(long_text, truncate=True)
# tokens.shape: [1, 77] — truncated to fit, last token is EOT

# Without truncate=True, raises RuntimeError
try:
    tokens = clip.tokenize(long_text)
except RuntimeError as e:
    print(f"Error: {e}")

Zero-shot Classification Setup

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Tokenize class descriptions and move to device
text_descriptions = [f"a photo of a {c}" for c in ["cat", "dog", "bird"]]
text_tokens = clip.tokenize(text_descriptions).to(device)
# text_tokens.shape: [3, 77]

# Encode with the model
with torch.no_grad():
    text_features = model.encode_text(text_tokens)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment