Principle:Deepset ai Haystack Document Embedding
Metadata
| Field | Value |
|---|---|
| Principle Name | Document Embedding |
| Domains | NLP, Embeddings |
| Related Implementation | Deepset_ai_Haystack_SentenceTransformersDocumentEmbedder |
| Source Reference | haystack/components/embedders/sentence_transformers_document_embedder.py:L18-270
|
| Repository | Deepset_ai_Haystack |
Overview
Document embedding converts textual documents into dense vector representations for semantic search. It uses sentence transformer models to encode document content into fixed-dimensional embeddings that capture semantic meaning, enabling downstream retrieval components to find relevant documents based on meaning rather than exact keyword matches.
Description
Dense retrieval systems depend on the ability to represent documents as numerical vectors in a high-dimensional space. Document embedding is the process of transforming raw text into these dense vector representations. Each document is passed through a pre-trained neural network encoder that maps the input text to a fixed-size vector. Once embedded, documents can be compared to query embeddings using vector similarity measures such as dot product or cosine similarity.
In the Haystack framework, document embedding is a dedicated pipeline component that operates during the indexing phase. Documents are embedded before being written to a document store, so that at query time the store already contains precomputed vectors ready for fast similarity search.
The embedding process supports several important options:
- Meta field concatenation: Metadata fields (such as title, author, or category) can be prepended to the document content before embedding, enriching the vector representation with structured information.
- Prefix and suffix injection: Some embedding models (such as E5 and BGE) require an instruction prefix or suffix to be added to the input text. The component supports this natively.
- Batch processing: Documents are embedded in configurable batches to balance throughput and memory usage.
- Normalization: L2 normalization can be applied so that all embeddings have unit norm, which makes dot product equivalent to cosine similarity.
- Precision control: Embeddings can be quantized to lower precision formats (int8, uint8, binary, ubinary) to reduce storage and accelerate computation at the cost of some accuracy.
Theoretical Basis
Bi-Encoder Architecture
Document embedding relies on the bi-encoder (or dual-encoder) architecture. In this paradigm, documents and queries are independently encoded by separate forward passes through the same model (or two separate models that share a vector space). The key properties are:
- Independence: The document embedding does not depend on the query. This means document embeddings can be precomputed once and stored.
- Shared vector space: Both document and query embeddings exist in the same vector space, so similarity between a query and a document can be computed as a simple vector operation.
- Scalability: Because document embeddings are precomputed, retrieval reduces to a nearest-neighbor search, which can be accelerated with approximate nearest neighbor (ANN) indices.
Sentence Transformers
Sentence Transformers are transformer-based models fine-tuned for producing semantically meaningful sentence and paragraph embeddings. They are typically trained using contrastive learning objectives (such as multiple negatives ranking loss) on large-scale text pair datasets. The default model in Haystack, sentence-transformers/all-mpnet-base-v2, produces 768-dimensional embeddings and was trained on over 1 billion text pairs.
Similarity Computation
Once documents and queries are embedded in the same space, relevance is computed via:
- Cosine similarity:
cos(q, d) = (q . d) / (||q|| * ||d||) - Dot product:
score = q . d(equivalent to cosine when embeddings are L2-normalized)
Matryoshka Representation Learning
Some models support truncating embeddings to a lower dimension without significant accuracy loss, a technique known as Matryoshka Representation Learning. The component exposes a truncate_dim parameter to leverage this capability.
Usage
Document embedding is used in indexing pipelines to prepare documents for semantic search. A typical indexing pipeline consists of:
- A document converter or preprocessor that produces
Documentobjects. - A SentenceTransformersDocumentEmbedder that computes embeddings for each document.
- A DocumentWriter that persists the embedded documents to a document store.
At query time, a corresponding text embedder (using the same model) embeds the query, and a retriever fetches documents whose embeddings are closest to the query embedding.
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder())
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("embedder.documents", "writer.documents")
docs = [
Document(content="Python is a popular programming language"),
Document(content="Haystack is an open-source NLP framework"),
]
indexing_pipeline.run({"embedder": {"documents": docs}})
Related Pages
- Implementation: Deepset_ai_Haystack_SentenceTransformersDocumentEmbedder -- The concrete Haystack component that implements this principle.
- Related Principle: Deepset_ai_Haystack_Query_Text_Embedding -- The query-side counterpart; uses the same model to embed queries.
- Related Principle: Deepset_ai_Haystack_Embedding_Based_Retrieval -- The retrieval stage that consumes the embeddings produced by this component.