Principle:Ggml org Llama cpp Embedding Computation

Field	Value
Principle Name	Embedding Computation
Domain	Dense Vector Representations, Transformer Hidden States
Description	Theory of extracting dense vector representations from transformer hidden states: per-token vs pooled embeddings
Related Workflow	Embedding_Extraction (CORE)

Overview

Description

The Embedding Computation principle defines the core theory of extracting dense vector representations from transformer language models. The computation takes tokenized input, processes it through the transformer layers, and extracts the hidden state vectors that encode semantic meaning. These vectors can be extracted at the per-token level or aggregated (pooled) into a single vector per input sequence.

This principle covers:

Forward pass execution: Running the model's transformer layers on the input batch to produce hidden state activations.
Per-token embedding extraction: Retrieving individual hidden state vectors for each token position, useful for token-level tasks.
Sequence-level pooled embeddings: Aggregating token-level vectors into a single fixed-dimensional vector per input sequence using the configured pooling strategy.
KV cache management: Clearing the key-value cache before each embedding batch since embeddings are independent computations that do not benefit from cached context.
Output dimensionality: Understanding the relationship between model embedding dimension (n_embd), output dimension (n_embd_out), and classification output dimension (n_cls_out).

Usage

Embedding computation is the central operation in any embedding extraction workflow. It transforms text into numerical vectors that can be used for:

Semantic search and retrieval-augmented generation (RAG)
Clustering and classification of documents
Similarity comparison between text pairs
Dimensionality reduction and visualization of text collections

Theoretical Basis

Transformer hidden states as semantic representations: In transformer architectures, each layer produces a hidden state vector for every input token. These vectors progressively encode increasingly abstract representations of meaning through the self-attention and feed-forward layers. The final layer's hidden states are the richest semantic representations and serve as the basis for embedding extraction.

Per-token vs. pooled embeddings represent two fundamentally different levels of granularity:

Per-token embeddings (LLAMA_POOLING_TYPE_NONE) preserve position-specific information. Retrieved via llama_get_embeddings_ith(ctx, i), they return a vector of dimension n_embd for each token position where logits[i] was set. These are useful when the downstream task requires per-position representations (e.g., named entity recognition, token classification).

Pooled embeddings aggregate all token representations into a single vector per input sequence. Retrieved via llama_get_embeddings_seq(ctx, seq_id), they return one vector per sequence ID. The aggregation method depends on the pooling type:
- Mean pooling: Averages all token vectors, giving equal weight to each position.
- CLS pooling: Uses only the first token's vector, following the BERT convention where the [CLS] token absorbs whole-sequence information during training.
- Last-token pooling: Uses the final token's vector, common for causal language models adapted for embedding tasks.
- Rank pooling: Passes the pooled representation through a classification head, returning relevance scores rather than embedding vectors.

KV cache irrelevance for embedding computation is an important optimization detail. Unlike autoregressive generation where the KV cache stores previously computed attention keys and values for efficiency, embedding computation processes each input independently. The cache must be cleared (llama_memory_clear) before each batch to prevent stale context from affecting the embeddings.

Output dimensionality varies by model and task. The base embedding dimension is determined by the model architecture (n_embd). Some models may have a different output embedding dimension (n_embd_out) when a projection head is present. For reranking models, the classification output may have a different dimension (n_cls_out), typically 1 for binary relevance scoring.

Related Pages

Implementation:Ggml_org_Llama_cpp_Llama_Get_Embeddings

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment