Implementation:Openai CLIP CLIP Encode Text
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning, Representation_Learning |
| Last Updated | 2026-02-13 22:00 GMT |
Overview
Concrete tool for encoding tokenized text into embedding vectors provided by the CLIP model class.
Description
The CLIP.encode_text() method takes a batch of tokenized text tensors (from clip.tokenize()) and processes them through the text transformer to produce feature vectors. It performs token embedding lookup, adds positional embeddings, runs the causal transformer, extracts features at the end-of-text (EOT) token position using argmax on the input tensor, applies layer normalization, and projects through the learned text_projection matrix.
The output is a tensor of shape [B, embed_dim] containing one feature vector per text input. These vectors are not L2-normalized by this method.
Usage
Call this method after tokenizing text with clip.tokenize(). Use within a torch.no_grad() context for inference. The resulting features can be compared with image features via cosine similarity.
Code Reference
Source Location
- Repository: OpenAI CLIP
- File: clip/model.py
- Lines: L343-356
Signature
def encode_text(self, text: torch.Tensor) -> torch.Tensor:
"""Encode tokenized text through the text transformer.
Parameters
----------
text : torch.Tensor
Batch of tokenized text, shape [B, context_length] (typically
[B, 77]). Output of clip.tokenize().
Returns
-------
torch.Tensor
Text feature vectors, shape [B, embed_dim]. Not L2-normalized.
"""
x = self.token_embedding(text).type(self.dtype) # [B, 77, d_model]
x = x + self.positional_embedding.type(self.dtype)
x = x.permute(1, 0, 2) # NLD -> LND
x = self.transformer(x)
x = x.permute(1, 0, 2) # LND -> NLD
x = self.ln_final(x).type(self.dtype)
# Extract features at EOT token position
x = x[torch.arange(x.shape[0]), text.argmax(dim=-1)] @ self.text_projection
return x
Import
import clip
model, preprocess = clip.load("ViT-B/32")
# Then call: model.encode_text(token_tensor)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| text | torch.Tensor | Yes | Batch of tokenized text, shape [B, 77], on the same device as the model. Output of clip.tokenize(). |
Outputs
| Name | Type | Description |
|---|---|---|
| text_features | torch.Tensor | Text embedding vectors of shape [B, embed_dim]. embed_dim depends on model variant (e.g. 512 for ViT-B/32). Not L2-normalized. |
Usage Examples
Basic Text Encoding
import clip
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
# Tokenize and encode text descriptions
texts = ["a photo of a cat", "a photo of a dog"]
text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad():
text_features = model.encode_text(text_tokens)
# text_features.shape: [2, 512]
Encoding with Normalization for Similarity
import clip
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
texts = ["a cat", "a dog", "a bird"]
text_tokens = clip.tokenize(texts).to(device)
with torch.no_grad():
text_features = model.encode_text(text_tokens)
# L2 normalize for cosine similarity
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# text_features.shape: [3, 512], unit norm