Principle:Openai CLIP Text Feature Encoding

Knowledge Sources	Learning Transferable Visual Models From Natural Language Supervision Attention Is All You Need
Domains	NLP, Deep_Learning, Representation_Learning
Last Updated	2026-02-13 22:00 GMT

Overview

A text encoding mechanism that maps tokenized text sequences into a shared embedding space where they can be compared directly with image embeddings via cosine similarity.

Description

Text Feature Encoding is the process of transforming tokenized text (integer token IDs) into a fixed-dimensional feature vector that captures the semantic meaning of the text. In CLIP, this is performed by a causal Transformer that processes the token sequence and extracts the feature at the position of the end-of-text (EOT) token.

The encoding pipeline consists of:

Token embedding: Map each integer token ID to a dense vector using a learned embedding table (vocab size ~49K, width = transformer_width).
Positional embedding: Add learned positional embeddings (77 positions, matching the context length).
Causal transformer: Process the sequence through transformer layers with a causal (triangular) attention mask that prevents tokens from attending to future positions.
EOT feature extraction: Extract the feature vector at the position of the end-of-text token, which serves as the sequence-level representation (analogous to [CLS] in BERT, but positioned at the EOT).
Linear projection: Project the extracted feature through a learned text_projection matrix to map from transformer_width to the shared embed_dim.

The resulting vector has dimension embed_dim and lies in the same space as image features.

Usage

Use this principle to encode text descriptions, class labels, or prompt templates into CLIP's embedding space for comparison with image embeddings. Essential for zero-shot classification, text-image retrieval, and prompt engineering workflows.

Theoretical Basis

The text encoder uses a causal (autoregressive) attention mask despite being used for encoding (not generation). This architectural choice from the original CLIP training allows the model to process text left-to-right:

# Causal attention mask construction
mask = torch.empty(context_length, context_length)
mask.fill_(float("-inf"))
mask.triu_(1)  # Upper triangular = -inf, diagonal + below = 0
# Result: each position can only attend to itself and earlier positions

The EOT token extraction is the key mechanism for obtaining a sequence-level representation:

# The EOT token is the highest-numbered token in each sequence
# (because BPE token IDs < EOT ID, and padding is 0)
eot_positions = text.argmax(dim=-1)  # Find EOT position per sequence
text_features = hidden_states[batch_indices, eot_positions]
text_features = text_features @ text_projection  # Project to embed_dim

This is analogous to using the [CLS] token in BERT-like models, but placed at the end of the meaningful content rather than the beginning.

Related Pages

Implemented By

Implementation:Openai_CLIP_CLIP_Encode_Text

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment