Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Deepset ai Haystack Document Embedding

From Leeroopedia

Metadata

Field Value
Principle Name Document Embedding
Domains NLP, Embeddings
Related Implementation Deepset_ai_Haystack_SentenceTransformersDocumentEmbedder
Source Reference haystack/components/embedders/sentence_transformers_document_embedder.py:L18-270
Repository Deepset_ai_Haystack

Overview

Document embedding converts textual documents into dense vector representations for semantic search. It uses sentence transformer models to encode document content into fixed-dimensional embeddings that capture semantic meaning, enabling downstream retrieval components to find relevant documents based on meaning rather than exact keyword matches.

Description

Dense retrieval systems depend on the ability to represent documents as numerical vectors in a high-dimensional space. Document embedding is the process of transforming raw text into these dense vector representations. Each document is passed through a pre-trained neural network encoder that maps the input text to a fixed-size vector. Once embedded, documents can be compared to query embeddings using vector similarity measures such as dot product or cosine similarity.

In the Haystack framework, document embedding is a dedicated pipeline component that operates during the indexing phase. Documents are embedded before being written to a document store, so that at query time the store already contains precomputed vectors ready for fast similarity search.

The embedding process supports several important options:

  • Meta field concatenation: Metadata fields (such as title, author, or category) can be prepended to the document content before embedding, enriching the vector representation with structured information.
  • Prefix and suffix injection: Some embedding models (such as E5 and BGE) require an instruction prefix or suffix to be added to the input text. The component supports this natively.
  • Batch processing: Documents are embedded in configurable batches to balance throughput and memory usage.
  • Normalization: L2 normalization can be applied so that all embeddings have unit norm, which makes dot product equivalent to cosine similarity.
  • Precision control: Embeddings can be quantized to lower precision formats (int8, uint8, binary, ubinary) to reduce storage and accelerate computation at the cost of some accuracy.

Theoretical Basis

Bi-Encoder Architecture

Document embedding relies on the bi-encoder (or dual-encoder) architecture. In this paradigm, documents and queries are independently encoded by separate forward passes through the same model (or two separate models that share a vector space). The key properties are:

  • Independence: The document embedding does not depend on the query. This means document embeddings can be precomputed once and stored.
  • Shared vector space: Both document and query embeddings exist in the same vector space, so similarity between a query and a document can be computed as a simple vector operation.
  • Scalability: Because document embeddings are precomputed, retrieval reduces to a nearest-neighbor search, which can be accelerated with approximate nearest neighbor (ANN) indices.

Sentence Transformers

Sentence Transformers are transformer-based models fine-tuned for producing semantically meaningful sentence and paragraph embeddings. They are typically trained using contrastive learning objectives (such as multiple negatives ranking loss) on large-scale text pair datasets. The default model in Haystack, sentence-transformers/all-mpnet-base-v2, produces 768-dimensional embeddings and was trained on over 1 billion text pairs.

Similarity Computation

Once documents and queries are embedded in the same space, relevance is computed via:

  • Cosine similarity: cos(q, d) = (q . d) / (||q|| * ||d||)
  • Dot product: score = q . d (equivalent to cosine when embeddings are L2-normalized)

Matryoshka Representation Learning

Some models support truncating embeddings to a lower dimension without significant accuracy loss, a technique known as Matryoshka Representation Learning. The component exposes a truncate_dim parameter to leverage this capability.

Usage

Document embedding is used in indexing pipelines to prepare documents for semantic search. A typical indexing pipeline consists of:

  1. A document converter or preprocessor that produces Document objects.
  2. A SentenceTransformersDocumentEmbedder that computes embeddings for each document.
  3. A DocumentWriter that persists the embedded documents to a document store.

At query time, a corresponding text embedder (using the same model) embeds the query, and a retriever fetches documents whose embeddings are closest to the query embedding.

from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

indexing_pipeline = Pipeline()
indexing_pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder())
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
indexing_pipeline.connect("embedder.documents", "writer.documents")

docs = [
    Document(content="Python is a popular programming language"),
    Document(content="Haystack is an open-source NLP framework"),
]
indexing_pipeline.run({"embedder": {"documents": docs}})

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment