Workflow:Ggml org Llama cpp Embedding Extraction

Knowledge Sources	llama.cpp Embedding Example
Domains	LLMs, Embeddings, Retrieval
Last Updated	2026-02-14 22:00 GMT

Overview

End-to-end process for extracting dense vector embeddings from text inputs using a GGUF embedding model, supporting multiple pooling strategies and batch processing.

Description

This workflow generates fixed-dimensional dense vector representations (embeddings) from text inputs using GGUF models configured for embedding extraction. The embeddings capture semantic meaning of the input text and can be used for similarity search, clustering, retrieval-augmented generation (RAG), and classification tasks. The workflow supports multiple pooling strategies (mean, CLS token, last token, rank-based), batch processing of multiple inputs for efficiency, and output in various formats including JSON and raw arrays. It can also compute cosine similarity matrices between pairs of inputs.

Usage

Execute this workflow when you need to convert text into numerical vector representations for semantic search, RAG pipelines, document similarity computation, text clustering, or classification. The input model must be an embedding model (e.g., nomic-embed, BGE, GTE, E5) converted to GGUF format, or a causal LM with embedding extraction enabled.

Execution Steps

Step 1: Load Embedding Model

Load a GGUF embedding model with the embedding flag enabled. This configures the model and context to compute and expose the internal activation vectors rather than generating text tokens. The model's pooling type (mean, CLS, last, rank) is read from the GGUF metadata.

Key considerations:

The model must support embedding extraction (embedding models or causal LMs with embedding mode)
Pooling type is typically set in the model metadata but can be overridden
Context size should accommodate the longest expected input text
Embedding dimensionality is fixed per model (e.g., 768, 1024, 4096)

Step 2: Prepare Input Texts

Collect and preprocess the text inputs to be embedded. Multiple texts can be provided and will be processed in batches. Each text is assigned a unique sequence ID for tracking through the batch processing pipeline.

Key considerations:

Inputs can be split from a single file using a configurable separator
Each input becomes a separate sequence in the batch
Very long inputs should be truncated to the model's context length
Some embedding models require specific prefixes (e.g., "query:" or "passage:")

Step 3: Tokenize and Batch

Tokenize all input texts and organize them into batches for efficient parallel processing. Multiple sequences are packed into a single batch with distinct sequence IDs, allowing the model to process them simultaneously.

Key considerations:

Batch size is limited by available memory and configured context
Sequences of different lengths are padded or handled with attention masks
Token counts determine the actual compute cost per batch

Step 4: Compute Embeddings

Run the model's decode pass on each batch to compute the internal representations. The pooling layer aggregates token-level embeddings into a single vector per input sequence. The resulting embedding vectors are extracted from the model's output state.

Key considerations:

Mean pooling averages all token embeddings
CLS pooling uses only the first token's embedding
Last-token pooling uses the final token's embedding
Rank pooling produces scalar similarity scores for reranking

Step 5: Normalize and Output

Apply L2 normalization to the raw embedding vectors so that cosine similarity can be computed as a simple dot product. Output the normalized embeddings in the requested format (JSON array, raw float array, or as a similarity matrix).

Key considerations:

Normalization is essential for cosine similarity computations
Output format should match the downstream application's expectations
Cosine similarity between normalized vectors ranges from -1 to 1
The similarity matrix option is useful for comparing all input pairs

Execution Diagram

GitHub URL

Workflow Repository