Implementation:Deepset ai Haystack InMemoryBM25Retriever

Metadata

Field	Value
Implementation Name	InMemoryBM25Retriever
Implementing Principle	Deepset_ai_Haystack_BM25_Keyword_Retrieval
Class	`InMemoryBM25Retriever`
Module	`haystack.components.retrievers.in_memory.bm25_retriever`
Source Reference	`haystack/components/retrievers/in_memory/bm25_retriever.py:L13-196`
Repository	Deepset_ai_Haystack
Dependencies	rank_bm25 (via InMemoryDocumentStore)

Overview

InMemoryBM25Retriever is a Haystack component that retrieves documents from an InMemoryDocumentStore using BM25 keyword-based scoring. It ranks documents by their lexical relevance to a text query, leveraging the BM25 algorithm implemented within the document store via the rank_bm25 library. No neural model or GPU is required.

Description

This retriever delegates the actual BM25 scoring to the InMemoryDocumentStore.bm25_retrieval() method. The component's role is to serve as a pipeline-compatible interface that manages configuration, filter policies, and parameter overrides at runtime.

Key behaviors:

Document store validation: The constructor enforces that the provided document store is an instance of InMemoryDocumentStore, raising a ValueError otherwise.
Filter policy: Supports two filter policies via FilterPolicy:
- REPLACE (default): Runtime filters completely override initialization filters.
- MERGE: Runtime filters are merged with initialization filters to narrow the search.
Score scaling: When scale_score=True, raw BM25 scores are normalized to a 0-1 range where 1 indicates maximum relevance.
Top-k validation: The top_k parameter must be greater than 0; a ValueError is raised otherwise.
Async support: The component provides a run_async() method that calls the document store's asynchronous BM25 retrieval interface.

Code Reference

Import

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

Constructor Signature

InMemoryBM25Retriever(
    document_store: InMemoryDocumentStore,
    filters: dict[str, Any] | None = None,
    top_k: int = 10,
    scale_score: bool = False,
    filter_policy: FilterPolicy = FilterPolicy.REPLACE,
)

Parameter	Type	Default	Description
`document_store`	`InMemoryDocumentStore`	required	The document store to retrieve from. Must be an InMemoryDocumentStore instance.
`filters`	None	`None`	Default filters to narrow the search space.
`top_k`	`int`	`10`	Maximum number of documents to return.
`scale_score`	`bool`	`False`	When True, scales scores to a 0-1 range.
`filter_policy`	`FilterPolicy`	`FilterPolicy.REPLACE`	How runtime filters interact with init filters (REPLACE or MERGE).

I/O Contract

Input

Parameter	Type	Required	Description
`query`	`str`	Yes	The query string to match against documents.
`filters`	None	No	Runtime filters to apply. Behavior depends on `filter_policy`.
`top_k`	None	No	Override the default maximum number of documents to return.
`scale_score`	None	No	Override the default score scaling behavior.

Output

Key	Type	Description
`documents`	`list[Document]`	Documents ranked by BM25 relevance, sorted from most to least relevant.

The output dictionary has the structure:

{"documents": list[Document]}

Each returned Document has its score field populated with the BM25 relevance score (raw or scaled depending on configuration).

Usage Examples

Basic Keyword Retrieval

from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

docs = [
    Document(content="Python is a popular programming language"),
    Document(content="python ist eine beliebte Programmiersprache"),
]

doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = InMemoryBM25Retriever(doc_store)

result = retriever.run(query="Programmiersprache")
print(result["documents"])

Retrieval with Filters

from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

docs = [
    Document(content="Python is great for data science", meta={"language": "en"}),
    Document(content="Python eignet sich hervorragend fuer Data Science", meta={"language": "de"}),
]

doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)

retriever = InMemoryBM25Retriever(doc_store, top_k=5, scale_score=True)
result = retriever.run(
    query="data science",
    filters={"field": "meta.language", "operator": "==", "value": "en"},
)
for doc in result["documents"]:
    print(f"{doc.content} (score: {doc.score:.4f})")

In a Query Pipeline

from haystack import Document, Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

doc_store = InMemoryDocumentStore()
doc_store.write_documents([
    Document(content="Machine learning is a subset of artificial intelligence"),
    Document(content="Deep learning uses neural networks with many layers"),
    Document(content="Natural language processing deals with text data"),
])

pipeline = Pipeline()
pipeline.add_component("retriever", InMemoryBM25Retriever(document_store=doc_store, top_k=3))

result = pipeline.run({"retriever": {"query": "neural networks deep learning"}})
for doc in result["retriever"]["documents"]:
    print(doc.content, doc.score)

Related Pages

Implements Principle

Principle:Deepset_ai_Haystack_BM25_Keyword_Retrieval

Principle: Deepset_ai_Haystack_BM25_Keyword_Retrieval -- The principle that this component implements.
Related Implementation: Deepset_ai_Haystack_InMemoryEmbeddingRetriever -- The embedding-based retriever for the same document store.
Related Implementation: Deepset_ai_Haystack_TransformersSimilarityRanker -- Cross-encoder ranker often used to rerank BM25 results.

Requires Environment

Environment:Deepset_ai_Haystack_Python_Runtime_Environment

Uses Heuristic

Heuristic:Deepset_ai_Haystack_BM25_Score_Scaling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment