Heuristic:Deepset ai Haystack Embedding Batch Size And Prefix
| Knowledge Sources | |
|---|---|
| Domains | Embeddings, Optimization |
| Last Updated | 2026-02-11 20:00 GMT |
Overview
Default embedding batch size of 32 balances throughput and memory; use prefix/suffix strings to match model-specific instruction patterns like E5 and BGE for optimal retrieval quality.
Description
Sentence Transformer embedders in Haystack use a default batch size of 32, which is empirically tuned for typical GPU configurations. The prefix and suffix parameters allow prepending/appending instruction strings required by certain embedding models. For example, E5 models expect queries prefixed with "query:" and documents prefixed with "passage:", while BGE models use "Represent this sentence for searching:". Using the correct prefix/suffix is critical for models trained with instruction-aware objectives, as omitting them can degrade retrieval quality significantly.
Usage
Apply this heuristic when configuring embedding components for production pipelines, switching embedding models (especially between instruction-aware models like E5, BGE, and non-instruction models like all-mpnet-base-v2), or tuning throughput vs. memory usage on specific hardware.
The Insight (Rule of Thumb)
- Batch Size: Default `batch_size=32` works well for most GPUs. Reduce to 8-16 for consumer GPUs with < 8GB VRAM; increase to 64-128 for A100/H100.
- Prefix for E5 models: Set `prefix="query: "` on text embedder and `prefix="passage: "` on document embedder.
- Prefix for BGE models: Set `prefix="Represent this sentence for searching relevant passages: "` on text embedder.
- Normalization: Set `normalize_embeddings=True` only when using cosine similarity; leave False for dot product.
- Precision: Use `precision="float32"` (default) for best accuracy; use `"int8"` or `"binary"` for faster search at reduced quality.
- Truncation: Only use `truncate_dim` with Matryoshka-trained models; arbitrary truncation degrades quality.
- Trade-off: Larger batches improve throughput but increase VRAM usage linearly.
Reasoning
Batch size 32 was chosen empirically as a sweet spot: small enough to fit on a 6-8GB consumer GPU with a typical sentence-transformers model (~400MB), large enough to benefit from GPU parallelism. The prefix/suffix pattern is not optional for instruction-tuned models; these models were trained to distinguish between query and document embeddings using instruction prefixes, and omitting them can reduce nDCG@10 by 10-20%.
Code evidence from `haystack/components/embedders/sentence_transformers_text_embedder.py:38-57`:
def __init__(
self,
model: str = "sentence-transformers/all-mpnet-base-v2",
device: ComponentDevice | None = None,
token: Secret | None = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"], strict=False),
prefix: str = "",
suffix: str = "",
batch_size: int = 32,
progress_bar: bool = True,
normalize_embeddings: bool = False,
trust_remote_code: bool = False,
local_files_only: bool = False,
truncate_dim: int | None = None,
...
precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = "float32",
...
backend: Literal["torch", "onnx", "openvino"] = "torch",
):
Truncation warning from `sentence_transformers_text_embedder.py:87-90`:
:param truncate_dim:
The dimension to truncate sentence embeddings to. `None` does no truncation.
If the model has not been trained with Matryoshka Representation Learning,
truncation of embeddings can significantly affect performance.