Workflow:Ggml org Llama cpp Embedding Extraction
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Embeddings, Retrieval |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
End-to-end process for extracting dense vector embeddings from text inputs using a GGUF embedding model, supporting multiple pooling strategies and batch processing.
Description
This workflow generates fixed-dimensional dense vector representations (embeddings) from text inputs using GGUF models configured for embedding extraction. The embeddings capture semantic meaning of the input text and can be used for similarity search, clustering, retrieval-augmented generation (RAG), and classification tasks. The workflow supports multiple pooling strategies (mean, CLS token, last token, rank-based), batch processing of multiple inputs for efficiency, and output in various formats including JSON and raw arrays. It can also compute cosine similarity matrices between pairs of inputs.
Usage
Execute this workflow when you need to convert text into numerical vector representations for semantic search, RAG pipelines, document similarity computation, text clustering, or classification. The input model must be an embedding model (e.g., nomic-embed, BGE, GTE, E5) converted to GGUF format, or a causal LM with embedding extraction enabled.
Execution Steps
Step 1: Load Embedding Model
Load a GGUF embedding model with the embedding flag enabled. This configures the model and context to compute and expose the internal activation vectors rather than generating text tokens. The model's pooling type (mean, CLS, last, rank) is read from the GGUF metadata.
Key considerations:
- The model must support embedding extraction (embedding models or causal LMs with embedding mode)
- Pooling type is typically set in the model metadata but can be overridden
- Context size should accommodate the longest expected input text
- Embedding dimensionality is fixed per model (e.g., 768, 1024, 4096)
Step 2: Prepare Input Texts
Collect and preprocess the text inputs to be embedded. Multiple texts can be provided and will be processed in batches. Each text is assigned a unique sequence ID for tracking through the batch processing pipeline.
Key considerations:
- Inputs can be split from a single file using a configurable separator
- Each input becomes a separate sequence in the batch
- Very long inputs should be truncated to the model's context length
- Some embedding models require specific prefixes (e.g., "query:" or "passage:")
Step 3: Tokenize and Batch
Tokenize all input texts and organize them into batches for efficient parallel processing. Multiple sequences are packed into a single batch with distinct sequence IDs, allowing the model to process them simultaneously.
Key considerations:
- Batch size is limited by available memory and configured context
- Sequences of different lengths are padded or handled with attention masks
- Token counts determine the actual compute cost per batch
Step 4: Compute Embeddings
Run the model's decode pass on each batch to compute the internal representations. The pooling layer aggregates token-level embeddings into a single vector per input sequence. The resulting embedding vectors are extracted from the model's output state.
Key considerations:
- Mean pooling averages all token embeddings
- CLS pooling uses only the first token's embedding
- Last-token pooling uses the final token's embedding
- Rank pooling produces scalar similarity scores for reranking
Step 5: Normalize and Output
Apply L2 normalization to the raw embedding vectors so that cosine similarity can be computed as a simple dot product. Output the normalized embeddings in the requested format (JSON array, raw float array, or as a similarity matrix).
Key considerations:
- Normalization is essential for cosine similarity computations
- Output format should match the downstream application's expectations
- Cosine similarity between normalized vectors ranges from -1 to 1
- The similarity matrix option is useful for comparing all input pairs