Implementation:Deepset ai Haystack SentenceTransformersDocumentEmbedder
Metadata
| Field | Value |
|---|---|
| Implementation Name | SentenceTransformersDocumentEmbedder |
| Implementing Principle | Deepset_ai_Haystack_Document_Embedding |
| Class | SentenceTransformersDocumentEmbedder
|
| Module | haystack.components.embedders.sentence_transformers_document_embedder
|
| Source Reference | haystack/components/embedders/sentence_transformers_document_embedder.py:L18-270
|
| Repository | Deepset_ai_Haystack |
| Dependencies | sentence-transformers, torch |
Overview
SentenceTransformersDocumentEmbedder is a Haystack component that calculates dense vector embeddings for a list of Document objects using Sentence Transformers models. It stores the computed embedding in the embedding field of each document. This component is designed for use in indexing pipelines, where documents are embedded before being written to a document store for later semantic retrieval.
Description
The component wraps the Sentence Transformers library to provide document embedding within the Haystack pipeline architecture. It supports a wide range of configuration options including model selection, device placement, batch processing, embedding normalization, precision control, and multiple inference backends (PyTorch, ONNX, OpenVINO).
Key behaviors:
- Lazy initialization: The underlying model is not loaded until
warm_up()is called (or automatically on firstrun()). - Meta field embedding: Metadata fields specified in
meta_fields_to_embedare concatenated with document content using theembedding_separatorbefore embedding. - Prefix/suffix support: A configurable prefix and suffix are prepended and appended to each document text, supporting instruction-based embedding models like E5 and BGE.
- Immutable documents: The component returns new
Documentinstances with theembeddingfield populated, usingdataclasses.replace()to avoid mutating the input.
Code Reference
Import
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
Constructor Signature
SentenceTransformersDocumentEmbedder(
model: str = "sentence-transformers/all-mpnet-base-v2",
device: ComponentDevice | None = None,
token: Secret | None = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"], strict=False),
prefix: str = "",
suffix: str = "",
batch_size: int = 32,
progress_bar: bool = True,
normalize_embeddings: bool = False,
meta_fields_to_embed: list[str] | None = None,
embedding_separator: str = "\n",
trust_remote_code: bool = False,
local_files_only: bool = False,
truncate_dim: int | None = None,
model_kwargs: dict[str, Any] | None = None,
tokenizer_kwargs: dict[str, Any] | None = None,
config_kwargs: dict[str, Any] | None = None,
precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = "float32",
encode_kwargs: dict[str, Any] | None = None,
backend: Literal["torch", "onnx", "openvino"] = "torch",
revision: str | None = None,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
model |
str |
"sentence-transformers/all-mpnet-base-v2" |
Hugging Face model ID or local path for the embedding model. |
device |
None | None |
Device to load the model on. If None, uses the default device. |
token |
None | env var | API token for private Hugging Face models. |
prefix |
str |
"" |
String prepended to each document text before embedding. |
suffix |
str |
"" |
String appended to each document text before embedding. |
batch_size |
int |
32 |
Number of documents to embed in each batch. |
progress_bar |
bool |
True |
Whether to display a progress bar during embedding. |
normalize_embeddings |
bool |
False |
If True, L2-normalizes embeddings to unit length. |
meta_fields_to_embed |
None | None |
Metadata fields to concatenate with document content before embedding. |
embedding_separator |
str |
"\n" |
Separator between metadata fields and document content. |
trust_remote_code |
bool |
False |
Whether to allow custom model code from Hugging Face. |
local_files_only |
bool |
False |
If True, only use locally cached models. |
truncate_dim |
None | None |
Truncate embeddings to this dimensionality (Matryoshka support). |
model_kwargs |
None | None |
Additional kwargs for model loading. |
tokenizer_kwargs |
None | None |
Additional kwargs for tokenizer loading. |
config_kwargs |
None | None |
Additional kwargs for config loading. |
precision |
Literal[...] |
"float32" |
Embedding precision: float32, int8, uint8, binary, or ubinary. |
encode_kwargs |
None | None |
Additional kwargs passed to SentenceTransformer.encode.
|
backend |
Literal[...] |
"torch" |
Inference backend: torch, onnx, or openvino. |
revision |
None | None |
Specific model version (branch, tag, or commit ID). |
I/O Contract
Input
| Parameter | Type | Required | Description |
|---|---|---|---|
documents |
list[Document] |
Yes | List of Haystack Document objects to embed. |
Output
| Key | Type | Description |
|---|---|---|
documents |
list[Document] |
The input documents with the embedding field populated with dense vectors.
|
The output dictionary has the structure:
{"documents": list[Document]} # each Document now has .embedding set
Usage Examples
Basic Document Embedding
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
doc = Document(content="I love pizza!")
doc_embedder = SentenceTransformersDocumentEmbedder()
doc_embedder.warm_up()
result = doc_embedder.run([doc])
print(result["documents"][0].embedding)
# [-0.07804739475250244, 0.1498992145061493, ...]
Embedding with Metadata Fields
from haystack import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
docs = [
Document(content="Climate change is a global challenge.", meta={"title": "Climate Report"}),
Document(content="Python is a versatile language.", meta={"title": "Programming Guide"}),
]
embedder = SentenceTransformersDocumentEmbedder(
meta_fields_to_embed=["title"],
normalize_embeddings=True,
batch_size=64,
)
embedder.warm_up()
result = embedder.run(docs)
for doc in result["documents"]:
print(f"{doc.meta['title']}: embedding dim = {len(doc.embedding)}")
Full Indexing Pipeline
from haystack import Document, Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.writers import DocumentWriter
from haystack.document_stores.in_memory import InMemoryDocumentStore
document_store = InMemoryDocumentStore()
pipeline = Pipeline()
pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(
model="sentence-transformers/all-mpnet-base-v2",
batch_size=32,
normalize_embeddings=True,
))
pipeline.add_component("writer", DocumentWriter(document_store=document_store))
pipeline.connect("embedder.documents", "writer.documents")
docs = [Document(content="First document"), Document(content="Second document")]
pipeline.run({"embedder": {"documents": docs}})
Related Pages
Implements Principle
- Principle: Deepset_ai_Haystack_Document_Embedding -- The principle that this component implements.
- Related Implementation: Deepset_ai_Haystack_SentenceTransformersTextEmbedder -- The query-side embedder using the same model family.
- Related Implementation: Deepset_ai_Haystack_InMemoryEmbeddingRetriever -- Retriever that consumes the embeddings produced by this component.
Requires Environment
- Environment:Deepset_ai_Haystack_HuggingFace_Model_Environment
- Environment:Deepset_ai_Haystack_GPU_Device_Environment