Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepset ai Haystack SentenceTransformersDocumentEmbedder

From Leeroopedia

Metadata

Field Value
Implementation Name SentenceTransformersDocumentEmbedder
Implementing Principle Deepset_ai_Haystack_Document_Embedding
Class SentenceTransformersDocumentEmbedder
Module haystack.components.embedders.sentence_transformers_document_embedder
Source Reference haystack/components/embedders/sentence_transformers_document_embedder.py:L18-270
Repository Deepset_ai_Haystack
Dependencies sentence-transformers, torch

Overview

SentenceTransformersDocumentEmbedder is a Haystack component that calculates dense vector embeddings for a list of Document objects using Sentence Transformers models. It stores the computed embedding in the embedding field of each document. This component is designed for use in indexing pipelines, where documents are embedded before being written to a document store for later semantic retrieval.

Description

The component wraps the Sentence Transformers library to provide document embedding within the Haystack pipeline architecture. It supports a wide range of configuration options including model selection, device placement, batch processing, embedding normalization, precision control, and multiple inference backends (PyTorch, ONNX, OpenVINO).

Key behaviors:

  • Lazy initialization: The underlying model is not loaded until warm_up() is called (or automatically on first run()).
  • Meta field embedding: Metadata fields specified in meta_fields_to_embed are concatenated with document content using the embedding_separator before embedding.
  • Prefix/suffix support: A configurable prefix and suffix are prepended and appended to each document text, supporting instruction-based embedding models like E5 and BGE.
  • Immutable documents: The component returns new Document instances with the embedding field populated, using dataclasses.replace() to avoid mutating the input.

Code Reference

Import

from haystack.components.embedders import SentenceTransformersDocumentEmbedder

Constructor Signature

SentenceTransformersDocumentEmbedder(
    model: str = "sentence-transformers/all-mpnet-base-v2",
    device: ComponentDevice | None = None,
    token: Secret | None = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"], strict=False),
    prefix: str = "",
    suffix: str = "",
    batch_size: int = 32,
    progress_bar: bool = True,
    normalize_embeddings: bool = False,
    meta_fields_to_embed: list[str] | None = None,
    embedding_separator: str = "\n",
    trust_remote_code: bool = False,
    local_files_only: bool = False,
    truncate_dim: int | None = None,
    model_kwargs: dict[str, Any] | None = None,
    tokenizer_kwargs: dict[str, Any] | None = None,
    config_kwargs: dict[str, Any] | None = None,
    precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = "float32",
    encode_kwargs: dict[str, Any] | None = None,
    backend: Literal["torch", "onnx", "openvino"] = "torch",
    revision: str | None = None,
)
Parameter Type Default Description
model str "sentence-transformers/all-mpnet-base-v2" Hugging Face model ID or local path for the embedding model.
device None None Device to load the model on. If None, uses the default device.
token None env var API token for private Hugging Face models.
prefix str "" String prepended to each document text before embedding.
suffix str "" String appended to each document text before embedding.
batch_size int 32 Number of documents to embed in each batch.
progress_bar bool True Whether to display a progress bar during embedding.
normalize_embeddings bool False If True, L2-normalizes embeddings to unit length.
meta_fields_to_embed None None Metadata fields to concatenate with document content before embedding.
embedding_separator str "\n" Separator between metadata fields and document content.
trust_remote_code bool False Whether to allow custom model code from Hugging Face.
local_files_only bool False If True, only use locally cached models.
truncate_dim None None Truncate embeddings to this dimensionality (Matryoshka support).
model_kwargs None None Additional kwargs for model loading.
tokenizer_kwargs None None Additional kwargs for tokenizer loading.
config_kwargs None None Additional kwargs for config loading.
precision Literal[...] "float32" Embedding precision: float32, int8, uint8, binary, or ubinary.
encode_kwargs None None Additional kwargs passed to SentenceTransformer.encode.
backend Literal[...] "torch" Inference backend: torch, onnx, or openvino.
revision None None Specific model version (branch, tag, or commit ID).

I/O Contract

Input

Parameter Type Required Description
documents list[Document] Yes List of Haystack Document objects to embed.

Output

Key Type Description
documents list[Document] The input documents with the embedding field populated with dense vectors.

The output dictionary has the structure:

{"documents": list[Document]}  # each Document now has .embedding set

Usage Examples

Basic Document Embedding

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

doc = Document(content="I love pizza!")
doc_embedder = SentenceTransformersDocumentEmbedder()
doc_embedder.warm_up()

result = doc_embedder.run([doc])
print(result["documents"][0].embedding)
# [-0.07804739475250244, 0.1498992145061493, ...]

Embedding with Metadata Fields

from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

docs = [
    Document(content="Climate change is a global challenge.", meta={"title": "Climate Report"}),
    Document(content="Python is a versatile language.", meta={"title": "Programming Guide"}),
]

embedder = SentenceTransformersDocumentEmbedder(
    meta_fields_to_embed=["title"],
    normalize_embeddings=True,
    batch_size=64,
)
embedder.warm_up()

result = embedder.run(docs)
for doc in result["documents"]:
    print(f"{doc.meta['title']}: embedding dim = {len(doc.embedding)}")

Full Indexing Pipeline

from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

pipeline = Pipeline()
pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(
    model="sentence-transformers/all-mpnet-base-v2",
    batch_size=32,
    normalize_embeddings=True,
))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("embedder.documents", "writer.documents")

docs = [Document(content="First document"), Document(content="Second document")]
pipeline.run({"embedder": {"documents": docs}})

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment