Implementation:Deepset ai Haystack SentenceTransformersDocumentEmbedder

Metadata

Field	Value
Implementation Name	SentenceTransformersDocumentEmbedder
Implementing Principle	Deepset_ai_Haystack_Document_Embedding
Class	`SentenceTransformersDocumentEmbedder`
Module	`haystack.components.embedders.sentence_transformers_document_embedder`
Source Reference	`haystack/components/embedders/sentence_transformers_document_embedder.py:L18-270`
Repository	Deepset_ai_Haystack
Dependencies	sentence-transformers, torch

Overview

SentenceTransformersDocumentEmbedder is a Haystack component that calculates dense vector embeddings for a list of Document objects using Sentence Transformers models. It stores the computed embedding in the embedding field of each document. This component is designed for use in indexing pipelines, where documents are embedded before being written to a document store for later semantic retrieval.

Description

The component wraps the Sentence Transformers library to provide document embedding within the Haystack pipeline architecture. It supports a wide range of configuration options including model selection, device placement, batch processing, embedding normalization, precision control, and multiple inference backends (PyTorch, ONNX, OpenVINO).

Key behaviors:

Lazy initialization: The underlying model is not loaded until warm_up() is called (or automatically on first run()).
Meta field embedding: Metadata fields specified in meta_fields_to_embed are concatenated with document content using the embedding_separator before embedding.
Prefix/suffix support: A configurable prefix and suffix are prepended and appended to each document text, supporting instruction-based embedding models like E5 and BGE.
Immutable documents: The component returns new Document instances with the embedding field populated, using dataclasses.replace() to avoid mutating the input.

Code Reference

Import

from haystack.components.embedders import SentenceTransformersDocumentEmbedder

Constructor Signature

SentenceTransformersDocumentEmbedder(
    model: str = "sentence-transformers/all-mpnet-base-v2",
    device: ComponentDevice | None = None,
    token: Secret | None = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"], strict=False),
    prefix: str = "",
    suffix: str = "",
    batch_size: int = 32,
    progress_bar: bool = True,
    normalize_embeddings: bool = False,
    meta_fields_to_embed: list[str] | None = None,
    embedding_separator: str = "\n",
    trust_remote_code: bool = False,
    local_files_only: bool = False,
    truncate_dim: int | None = None,
    model_kwargs: dict[str, Any] | None = None,
    tokenizer_kwargs: dict[str, Any] | None = None,
    config_kwargs: dict[str, Any] | None = None,
    precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = "float32",
    encode_kwargs: dict[str, Any] | None = None,
    backend: Literal["torch", "onnx", "openvino"] = "torch",
    revision: str | None = None,
)

Parameter	Type	Default	Description
`model`	`str`	`"sentence-transformers/all-mpnet-base-v2"`	Hugging Face model ID or local path for the embedding model.
`device`	None	`None`	Device to load the model on. If None, uses the default device.
`token`	None	env var	API token for private Hugging Face models.
`prefix`	`str`	`""`	String prepended to each document text before embedding.
`suffix`	`str`	`""`	String appended to each document text before embedding.
`batch_size`	`int`	`32`	Number of documents to embed in each batch.
`progress_bar`	`bool`	`True`	Whether to display a progress bar during embedding.
`normalize_embeddings`	`bool`	`False`	If True, L2-normalizes embeddings to unit length.
`meta_fields_to_embed`	None	`None`	Metadata fields to concatenate with document content before embedding.
`embedding_separator`	`str`	`"\n"`	Separator between metadata fields and document content.
`trust_remote_code`	`bool`	`False`	Whether to allow custom model code from Hugging Face.
`local_files_only`	`bool`	`False`	If True, only use locally cached models.
`truncate_dim`	None	`None`	Truncate embeddings to this dimensionality (Matryoshka support).
`model_kwargs`	None	`None`	Additional kwargs for model loading.
`tokenizer_kwargs`	None	`None`	Additional kwargs for tokenizer loading.
`config_kwargs`	None	`None`	Additional kwargs for config loading.
`precision`	`Literal[...]`	`"float32"`	Embedding precision: float32, int8, uint8, binary, or ubinary.
`encode_kwargs`	None	`None`	Additional kwargs passed to `SentenceTransformer.encode`.
`backend`	`Literal[...]`	`"torch"`	Inference backend: torch, onnx, or openvino.
`revision`	None	`None`	Specific model version (branch, tag, or commit ID).

I/O Contract

Input

Parameter	Type	Required	Description
`documents`	`list[Document]`	Yes	List of Haystack Document objects to embed.

Output

Key	Type	Description
`documents`	`list[Document]`	The input documents with the `embedding` field populated with dense vectors.

The output dictionary has the structure:

{"documents": list[Document]}  # each Document now has .embedding set

Usage Examples

Basic Document Embedding

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

doc = Document(content="I love pizza!")
doc_embedder = SentenceTransformersDocumentEmbedder()
doc_embedder.warm_up()

result = doc_embedder.run([doc])
print(result["documents"][0].embedding)
# [-0.07804739475250244, 0.1498992145061493, ...]

Embedding with Metadata Fields

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

docs = [
    Document(content="Climate change is a global challenge.", meta={"title": "Climate Report"}),
    Document(content="Python is a versatile language.", meta={"title": "Programming Guide"}),
]

embedder = SentenceTransformersDocumentEmbedder(
    meta_fields_to_embed=["title"],
    normalize_embeddings=True,
    batch_size=64,
)
embedder.warm_up()

result = embedder.run(docs)
for doc in result["documents"]:
    print(f"{doc.meta['title']}: embedding dim = {len(doc.embedding)}")

Full Indexing Pipeline

from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-mpnet-base-v2",
    batch_size=32,
    normalize_embeddings=True,
))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("embedder.documents", "writer.documents")

docs = [Document(content="First document"), Document(content="Second document")]
pipeline.run({"embedder": {"documents": docs}})

Related Pages

Implements Principle

Principle:Deepset_ai_Haystack_Document_Embedding

Principle: Deepset_ai_Haystack_Document_Embedding -- The principle that this component implements.
Related Implementation: Deepset_ai_Haystack_SentenceTransformersTextEmbedder -- The query-side embedder using the same model family.
Related Implementation: Deepset_ai_Haystack_InMemoryEmbeddingRetriever -- Retriever that consumes the embeddings produced by this component.

Requires Environment

Uses Heuristic

Heuristic:Deepset_ai_Haystack_Embedding_Batch_Size_And_Prefix

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment