Heuristic:Deepset ai Haystack Embedding Batch Size And Prefix

Knowledge Sources	Haystack Sentence Transformers
Domains	Embeddings, Optimization
Last Updated	2026-02-11 20:00 GMT

Overview

Default embedding batch size of 32 balances throughput and memory; use prefix/suffix strings to match model-specific instruction patterns like E5 and BGE for optimal retrieval quality.

Description

Sentence Transformer embedders in Haystack use a default batch size of 32, which is empirically tuned for typical GPU configurations. The prefix and suffix parameters allow prepending/appending instruction strings required by certain embedding models. For example, E5 models expect queries prefixed with "query:" and documents prefixed with "passage:", while BGE models use "Represent this sentence for searching:". Using the correct prefix/suffix is critical for models trained with instruction-aware objectives, as omitting them can degrade retrieval quality significantly.

Usage

Apply this heuristic when configuring embedding components for production pipelines, switching embedding models (especially between instruction-aware models like E5, BGE, and non-instruction models like all-mpnet-base-v2), or tuning throughput vs. memory usage on specific hardware.

The Insight (Rule of Thumb)

Batch Size: Default `batch_size=32` works well for most GPUs. Reduce to 8-16 for consumer GPUs with < 8GB VRAM; increase to 64-128 for A100/H100.
Prefix for E5 models: Set `prefix="query: "` on text embedder and `prefix="passage: "` on document embedder.
Prefix for BGE models: Set `prefix="Represent this sentence for searching relevant passages: "` on text embedder.
Normalization: Set `normalize_embeddings=True` only when using cosine similarity; leave False for dot product.
Precision: Use `precision="float32"` (default) for best accuracy; use `"int8"` or `"binary"` for faster search at reduced quality.
Truncation: Only use `truncate_dim` with Matryoshka-trained models; arbitrary truncation degrades quality.
Trade-off: Larger batches improve throughput but increase VRAM usage linearly.

Reasoning

Batch size 32 was chosen empirically as a sweet spot: small enough to fit on a 6-8GB consumer GPU with a typical sentence-transformers model (~400MB), large enough to benefit from GPU parallelism. The prefix/suffix pattern is not optional for instruction-tuned models; these models were trained to distinguish between query and document embeddings using instruction prefixes, and omitting them can reduce nDCG@10 by 10-20%.

Code evidence from `haystack/components/embedders/sentence_transformers_text_embedder.py:38-57`:

def __init__(
    self,
    model: str = "sentence-transformers/all-mpnet-base-v2",
    device: ComponentDevice | None = None,
    token: Secret | None = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"], strict=False),
    prefix: str = "",
    suffix: str = "",
    batch_size: int = 32,
    progress_bar: bool = True,
    normalize_embeddings: bool = False,
    trust_remote_code: bool = False,
    local_files_only: bool = False,
    truncate_dim: int | None = None,
    ...
    precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = "float32",
    ...
    backend: Literal["torch", "onnx", "openvino"] = "torch",
):

Truncation warning from `sentence_transformers_text_embedder.py:87-90`:

:param truncate_dim:
    The dimension to truncate sentence embeddings to. `None` does no truncation.
    If the model has not been trained with Matryoshka Representation Learning,
    truncation of embeddings can significantly affect performance.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment