Principle:Deepset ai Haystack Document Embedding

Metadata

Field	Value
Principle Name	Document Embedding
Domains	NLP, Embeddings
Related Implementation	Deepset_ai_Haystack_SentenceTransformersDocumentEmbedder
Source Reference	`haystack/components/embedders/sentence_transformers_document_embedder.py:L18-270`
Repository	Deepset_ai_Haystack

Overview

Document embedding converts textual documents into dense vector representations for semantic search. It uses sentence transformer models to encode document content into fixed-dimensional embeddings that capture semantic meaning, enabling downstream retrieval components to find relevant documents based on meaning rather than exact keyword matches.

Description

Dense retrieval systems depend on the ability to represent documents as numerical vectors in a high-dimensional space. Document embedding is the process of transforming raw text into these dense vector representations. Each document is passed through a pre-trained neural network encoder that maps the input text to a fixed-size vector. Once embedded, documents can be compared to query embeddings using vector similarity measures such as dot product or cosine similarity.

In the Haystack framework, document embedding is a dedicated pipeline component that operates during the indexing phase. Documents are embedded before being written to a document store, so that at query time the store already contains precomputed vectors ready for fast similarity search.

The embedding process supports several important options:

Meta field concatenation: Metadata fields (such as title, author, or category) can be prepended to the document content before embedding, enriching the vector representation with structured information.
Prefix and suffix injection: Some embedding models (such as E5 and BGE) require an instruction prefix or suffix to be added to the input text. The component supports this natively.
Batch processing: Documents are embedded in configurable batches to balance throughput and memory usage.
Normalization: L2 normalization can be applied so that all embeddings have unit norm, which makes dot product equivalent to cosine similarity.
Precision control: Embeddings can be quantized to lower precision formats (int8, uint8, binary, ubinary) to reduce storage and accelerate computation at the cost of some accuracy.

Theoretical Basis

Bi-Encoder Architecture

Document embedding relies on the bi-encoder (or dual-encoder) architecture. In this paradigm, documents and queries are independently encoded by separate forward passes through the same model (or two separate models that share a vector space). The key properties are:

Independence: The document embedding does not depend on the query. This means document embeddings can be precomputed once and stored.
Shared vector space: Both document and query embeddings exist in the same vector space, so similarity between a query and a document can be computed as a simple vector operation.
Scalability: Because document embeddings are precomputed, retrieval reduces to a nearest-neighbor search, which can be accelerated with approximate nearest neighbor (ANN) indices.

Sentence Transformers

Sentence Transformers are transformer-based models fine-tuned for producing semantically meaningful sentence and paragraph embeddings. They are typically trained using contrastive learning objectives (such as multiple negatives ranking loss) on large-scale text pair datasets. The default model in Haystack, sentence-transformers/all-mpnet-base-v2, produces 768-dimensional embeddings and was trained on over 1 billion text pairs.

Similarity Computation

Once documents and queries are embedded in the same space, relevance is computed via:

Cosine similarity: cos(q, d) = (q . d) / (||q|| * ||d||)
Dot product: score = q . d (equivalent to cosine when embeddings are L2-normalized)

Matryoshka Representation Learning

Some models support truncating embeddings to a lower dimension without significant accuracy loss, a technique known as Matryoshka Representation Learning. The component exposes a truncate_dim parameter to leverage this capability.

Usage

Document embedding is used in indexing pipelines to prepare documents for semantic search. A typical indexing pipeline consists of:

A document converter or preprocessor that produces Document objects.
A SentenceTransformersDocumentEmbedder that computes embeddings for each document.
A DocumentWriter that persists the embedded documents to a document store.

At query time, a corresponding text embedder (using the same model) embeds the query, and a retriever fetches documents whose embeddings are closest to the query embedding.

from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder())
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("embedder.documents", "writer.documents")

docs = [
    Document(content="Python is a popular programming language"),
    Document(content="Haystack is an open-source NLP framework"),
]
indexing_pipeline.run({"embedder": {"documents": docs}})

Related Pages

Implementation: Deepset_ai_Haystack_SentenceTransformersDocumentEmbedder -- The concrete Haystack component that implements this principle.
Related Principle: Deepset_ai_Haystack_Query_Text_Embedding -- The query-side counterpart; uses the same model to embed queries.
Related Principle: Deepset_ai_Haystack_Embedding_Based_Retrieval -- The retrieval stage that consumes the embeddings produced by this component.

Implemented By

Implementation:Deepset_ai_Haystack_SentenceTransformersDocumentEmbedder

Uses Heuristic

Heuristic:Deepset_ai_Haystack_Embedding_Batch_Size_And_Prefix

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment