Principle:Deepset ai Haystack BM25 Keyword Retrieval

Metadata

Field	Value
Principle Name	BM25 Keyword Retrieval
Domains	Information_Retrieval, NLP
Related Implementation	Deepset_ai_Haystack_InMemoryBM25Retriever
Source Reference	`haystack/components/retrievers/in_memory/bm25_retriever.py:L13-196`
Repository	Deepset_ai_Haystack

Overview

BM25 (Best Matching 25) is a probabilistic keyword-based retrieval model that ranks documents by their relevance to a query using term frequency, inverse document frequency, and document length normalization. It extends the classical TF-IDF scoring with term saturation and configurable normalization parameters, making it one of the most effective and widely used sparse retrieval methods.

Description

BM25 keyword retrieval operates on the principle that documents containing the query terms are likely relevant, with additional sophistication beyond simple term counting. It is a sparse retrieval method, meaning it relies on exact or near-exact term matching rather than learned semantic representations.

In the Haystack framework, BM25 retrieval is provided as a pipeline component that queries an InMemoryDocumentStore. The document store maintains an internal BM25 index (powered by the rank_bm25 library) that is built when documents are written to the store. At query time, the retriever tokenizes the query and scores each document against it using the BM25 formula.

Key characteristics of BM25 retrieval:

No embedding required: Documents and queries are compared based on their raw text content. There is no need for a neural model or GPU.
Fast and lightweight: BM25 scoring is computationally inexpensive, making it suitable for large document collections and low-latency applications.
Exact match bias: BM25 excels when the query and relevant documents share vocabulary, but struggles with vocabulary mismatch (synonyms, paraphrases).
Configurable scoring: The BM25 parameters k1 and b control term saturation and length normalization behavior.

Theoretical Basis

The BM25 Formula

For a query Q containing terms q1, q2, ..., qn, the BM25 score of a document D is:

score(D, Q) = sum_i IDF(qi) * (f(qi, D) * (k1 + 1)) / (f(qi, D) + k1 * (1 - b + b * |D| / avgdl))

Where:

f(qi, D) is the frequency of term qi in document D.
|D| is the length (number of tokens) of document D.
avgdl is the average document length across the entire corpus.
k1 controls term frequency saturation (typically 1.2 to 2.0). Higher values give more weight to repeated terms.
b controls document length normalization (typically 0.75). A value of 0 disables length normalization; a value of 1 fully normalizes by document length.

Inverse Document Frequency (IDF)

The IDF component measures how informative a term is across the corpus:

IDF(qi) = ln((N - n(qi) + 0.5) / (n(qi) + 0.5) + 1)

Where:

N is the total number of documents in the corpus.
n(qi) is the number of documents containing term qi.

Terms that appear in many documents (common words) receive low IDF scores, while rare terms receive high IDF scores.

Term Saturation

Unlike raw TF-IDF, BM25 implements term saturation: the contribution of term frequency to the score increases sub-linearly and eventually plateaus. This means that a term appearing 10 times in a document does not score 10x higher than a term appearing once. The saturation behavior is controlled by the k1 parameter.

Document Length Normalization

BM25 normalizes for document length to avoid bias toward longer documents, which naturally contain more term occurrences. The b parameter controls how aggressively this normalization is applied. When b = 0.75 (the default), a document that is twice the average length has its term frequencies penalized, ensuring that a short document with high term density can score higher than a long document with the same raw term count.

Usage

BM25 retrieval is used in query pipelines for keyword-based document retrieval. A typical BM25 pipeline consists of:

An InMemoryBM25Retriever that scores documents against the query.
Optionally, a ranker (such as a cross-encoder) to rerank the top results for higher precision.

BM25 is also frequently combined with embedding-based retrieval in hybrid retrieval architectures, where BM25 results and embedding results are merged and reranked.

from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()
document_store.write_documents([
    Document(content="Python is a popular programming language"),
    Document(content="Haystack is an open-source NLP framework"),
    Document(content="BM25 is a ranking function used in information retrieval"),
])

retriever = InMemoryBM25Retriever(document_store=document_store)
result = retriever.run(query="information retrieval ranking")
for doc in result["documents"]:
    print(doc.content, doc.score)

Related Pages

Implementation: Deepset_ai_Haystack_InMemoryBM25Retriever -- The concrete Haystack component that implements this principle.
Related Principle: Deepset_ai_Haystack_Embedding_Based_Retrieval -- Dense retrieval, the semantic counterpart to BM25 keyword retrieval.
Related Principle: Deepset_ai_Haystack_Cross_Encoder_Reranking -- Reranking technique often applied on top of BM25 results to improve precision.

Implemented By

Implementation:Deepset_ai_Haystack_InMemoryBM25Retriever

Uses Heuristic

Heuristic:Deepset_ai_Haystack_BM25_Score_Scaling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment