Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Openai CLIP CLIP Encode Text

From Leeroopedia
Knowledge Sources
Domains NLP, Deep_Learning, Representation_Learning
Last Updated 2026-02-13 22:00 GMT

Overview

Concrete tool for encoding tokenized text into embedding vectors provided by the CLIP model class.

Description

The CLIP.encode_text() method takes a batch of tokenized text tensors (from clip.tokenize()) and processes them through the text transformer to produce feature vectors. It performs token embedding lookup, adds positional embeddings, runs the causal transformer, extracts features at the end-of-text (EOT) token position using argmax on the input tensor, applies layer normalization, and projects through the learned text_projection matrix.

The output is a tensor of shape [B, embed_dim] containing one feature vector per text input. These vectors are not L2-normalized by this method.

Usage

Call this method after tokenizing text with clip.tokenize(). Use within a torch.no_grad() context for inference. The resulting features can be compared with image features via cosine similarity.

Code Reference

Source Location

  • Repository: OpenAI CLIP
  • File: clip/model.py
  • Lines: L343-356

Signature

def encode_text(self, text: torch.Tensor) -> torch.Tensor:
    """Encode tokenized text through the text transformer.

    Parameters
    ----------
    text : torch.Tensor
        Batch of tokenized text, shape [B, context_length] (typically
        [B, 77]). Output of clip.tokenize().

    Returns
    -------
    torch.Tensor
        Text feature vectors, shape [B, embed_dim]. Not L2-normalized.
    """
    x = self.token_embedding(text).type(self.dtype)  # [B, 77, d_model]
    x = x + self.positional_embedding.type(self.dtype)
    x = x.permute(1, 0, 2)  # NLD -> LND
    x = self.transformer(x)
    x = x.permute(1, 0, 2)  # LND -> NLD
    x = self.ln_final(x).type(self.dtype)

    # Extract features at EOT token position
    x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
    return x

Import

import clip
model, preprocess = clip.load("ViT-B/32")
# Then call: model.encode_text(token_tensor)

I/O Contract

Inputs

Name Type Required Description
text torch.Tensor Yes Batch of tokenized text, shape [B, 77], on the same device as the model. Output of clip.tokenize().

Outputs

Name Type Description
text_features torch.Tensor Text embedding vectors of shape [B, embed_dim]. embed_dim depends on model variant (e.g. 512 for ViT-B/32). Not L2-normalized.

Usage Examples

Basic Text Encoding

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Tokenize and encode text descriptions
texts = ["a photo of a cat", "a photo of a dog"]
text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_tokens)
# text_features.shape: [2, 512]

Encoding with Normalization for Similarity

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

texts = ["a cat", "a dog", "a bird"]
text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_tokens)
    # L2 normalize for cosine similarity
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# text_features.shape: [3, 512], unit norm

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment