Heuristic:Deepset ai Haystack BM25 Score Scaling

Knowledge Sources	Haystack Empirical tuning of score normalization
Domains	Retrieval, Optimization
Last Updated	2026-02-11 20:00 GMT

Overview

BM25 raw scores are unbounded and normalized to 0-1 using the expit (sigmoid) function with an empirically chosen scaling factor of 8; increase this factor if most scores exceed 30.

Description

BM25 retrieval produces unbounded relevance scores that vary widely across corpora. To make scores comparable and interpretable, Haystack applies a sigmoid-based normalization using the expit function (inverse logit). A scaling factor controls how aggressively scores are compressed: `BM25_SCALING_FACTOR = 8` (default) maps a raw score of 10 to ~0.78. A separate `DOT_PRODUCT_SCALING_FACTOR = 100` is used for embedding similarity scores, which operate on a different scale.

Usage

Apply this heuristic when tuning retrieval scores for hybrid search (combining BM25 with embedding retrieval), when setting score thresholds for filtering, or when debugging why all BM25 scores cluster near 1.0. If your corpus produces consistently high BM25 scores (raw > 30), increase the scaling factor to spread scores across the 0-1 range.

The Insight (Rule of Thumb)

Action: Set `scale_score=True` (default) on BM25 retrievers for normalized 0-1 scores.
Value: Default `BM25_SCALING_FACTOR = 8`. Increase to 12-16 for corpora with high raw scores (>30).
Trade-off: Higher scaling factors compress high scores more, potentially making it harder to distinguish highly relevant documents.
Dot Product: Uses a separate factor of 100, reflecting the different magnitude of embedding similarities.

Reasoning

Raw BM25 scores depend on corpus statistics (document frequency, average document length) and can range from 0 to 50+. Without normalization, combining BM25 and embedding scores in hybrid search would be meaningless since they operate on different scales. The expit function `1 / (1 + exp(-x/factor))` was chosen because it smoothly maps any real number to (0, 1) and is monotonically increasing.

Empirical examples with different factors:

Raw score 10, factor 2: scaled to 0.99 (too compressed)
Raw score 10, factor 8: scaled to 0.78 (good discrimination)
Raw score 30, factor 8: scaled to 0.98 (still high; increase factor if most scores are here)

Code evidence from `haystack/document_stores/in_memory/document_store.py:27-35`:

# document scores are essentially unbounded and will be scaled to values between 0 and 1 if scale_score is set to
# True (default). Scaling uses the expit function (inverse of the logit function) after applying a scaling factor
# (e.g., BM25_SCALING_FACTOR for the bm25_retrieval method).
# Larger scaling factor decreases scaled scores. For example, an input of 10 is scaled to 0.99 with
# BM25_SCALING_FACTOR=2 but to 0.78 with BM25_SCALING_FACTOR=8 (default). The defaults were chosen empirically.
# Increase the default if most unscaled scores are larger than expected (>30) and otherwise would incorrectly all be
# mapped to scores ~1.
BM25_SCALING_FACTOR = 8
DOT_PRODUCT_SCALING_FACTOR = 100

BM25 algorithm selection from `haystack/document_stores/in_memory/document_store.py:67-72`:

bm25_tokenization_regex: str = r"(?u)\b\w\w+\b",  # Only match 2+ letter words
bm25_algorithm: Literal["BM25Okapi", "BM25L", "BM25Plus"] = "BM25L",
bm25_parameters: dict | None = None,

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment