Implementation:Openai CLIP CLIP Encode Text

Knowledge Sources	OpenAI CLIP Learning Transferable Visual Models From Natural Language Supervision
Domains	NLP, Deep_Learning, Representation_Learning
Last Updated	2026-02-13 22:00 GMT

Overview

Concrete tool for encoding tokenized text into embedding vectors provided by the CLIP model class.

Description

The CLIP.encode_text() method takes a batch of tokenized text tensors (from clip.tokenize()) and processes them through the text transformer to produce feature vectors. It performs token embedding lookup, adds positional embeddings, runs the causal transformer, extracts features at the end-of-text (EOT) token position using argmax on the input tensor, applies layer normalization, and projects through the learned text_projection matrix.

The output is a tensor of shape [B, embed_dim] containing one feature vector per text input. These vectors are not L2-normalized by this method.

Usage

Call this method after tokenizing text with clip.tokenize(). Use within a torch.no_grad() context for inference. The resulting features can be compared with image features via cosine similarity.

Code Reference

Source Location

Repository: OpenAI CLIP
File: clip/model.py
Lines: L343-356

Signature

def encode_text(self, text: torch.Tensor) -> torch.Tensor:
    """Encode tokenized text through the text transformer.

    Parameters
    ----------
    text : torch.Tensor
        Batch of tokenized text, shape [B, context_length] (typically
        [B, 77]). Output of clip.tokenize().

    Returns
    -------
    torch.Tensor
        Text feature vectors, shape [B, embed_dim]. Not L2-normalized.
    """
    x = self.token_embedding(text).type(self.dtype)  # [B, 77, d_model]
    x = x + self.positional_embedding.type(self.dtype)
    x = x.permute(1, 0, 2)  # NLD -> LND
    x = self.transformer(x)
    x = x.permute(1, 0, 2)  # LND -> NLD
    x = self.ln_final(x).type(self.dtype)

    # Extract features at EOT token position
    x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
    return x

Import

import clip
model, preprocess = clip.load("ViT-B/32")
# Then call: model.encode_text(token_tensor)

I/O Contract

Inputs

Name	Type	Required	Description
text	torch.Tensor	Yes	Batch of tokenized text, shape [B, 77], on the same device as the model. Output of clip.tokenize().

Outputs

Name	Type	Description
text_features	torch.Tensor	Text embedding vectors of shape [B, embed_dim]. embed_dim depends on model variant (e.g. 512 for ViT-B/32). Not L2-normalized.

Usage Examples

Basic Text Encoding

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Tokenize and encode text descriptions
texts = ["a photo of a cat", "a photo of a dog"]
text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_tokens)
# text_features.shape: [2, 512]

Encoding with Normalization for Similarity

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

texts = ["a cat", "a dog", "a bird"]
text_tokens = clip.tokenize(texts).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_tokens)
    # L2 normalize for cosine similarity
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# text_features.shape: [3, 512], unit norm

Related Pages

Implements Principle

Principle:Openai_CLIP_Text_Feature_Encoding

Requires Environment

Environment:Openai_CLIP_PyTorch_CUDA_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment