Implementation:Deepset ai Haystack InMemoryBM25Retriever
Metadata
| Field | Value |
|---|---|
| Implementation Name | InMemoryBM25Retriever |
| Implementing Principle | Deepset_ai_Haystack_BM25_Keyword_Retrieval |
| Class | InMemoryBM25Retriever
|
| Module | haystack.components.retrievers.in_memory.bm25_retriever
|
| Source Reference | haystack/components/retrievers/in_memory/bm25_retriever.py:L13-196
|
| Repository | Deepset_ai_Haystack |
| Dependencies | rank_bm25 (via InMemoryDocumentStore) |
Overview
InMemoryBM25Retriever is a Haystack component that retrieves documents from an InMemoryDocumentStore using BM25 keyword-based scoring. It ranks documents by their lexical relevance to a text query, leveraging the BM25 algorithm implemented within the document store via the rank_bm25 library. No neural model or GPU is required.
Description
This retriever delegates the actual BM25 scoring to the InMemoryDocumentStore.bm25_retrieval() method. The component's role is to serve as a pipeline-compatible interface that manages configuration, filter policies, and parameter overrides at runtime.
Key behaviors:
- Document store validation: The constructor enforces that the provided document store is an instance of
InMemoryDocumentStore, raising aValueErrorotherwise. - Filter policy: Supports two filter policies via
FilterPolicy:REPLACE(default): Runtime filters completely override initialization filters.MERGE: Runtime filters are merged with initialization filters to narrow the search.
- Score scaling: When
scale_score=True, raw BM25 scores are normalized to a 0-1 range where 1 indicates maximum relevance. - Top-k validation: The
top_kparameter must be greater than 0; aValueErroris raised otherwise. - Async support: The component provides a
run_async()method that calls the document store's asynchronous BM25 retrieval interface.
Code Reference
Import
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
Constructor Signature
InMemoryBM25Retriever(
document_store: InMemoryDocumentStore,
filters: dict[str, Any] | None = None,
top_k: int = 10,
scale_score: bool = False,
filter_policy: FilterPolicy = FilterPolicy.REPLACE,
)
| Parameter | Type | Default | Description |
|---|---|---|---|
document_store |
InMemoryDocumentStore |
required | The document store to retrieve from. Must be an InMemoryDocumentStore instance. |
filters |
None | None |
Default filters to narrow the search space. |
top_k |
int |
10 |
Maximum number of documents to return. |
scale_score |
bool |
False |
When True, scales scores to a 0-1 range. |
filter_policy |
FilterPolicy |
FilterPolicy.REPLACE |
How runtime filters interact with init filters (REPLACE or MERGE). |
I/O Contract
Input
| Parameter | Type | Required | Description |
|---|---|---|---|
query |
str |
Yes | The query string to match against documents. |
filters |
None | No | Runtime filters to apply. Behavior depends on filter_policy.
|
top_k |
None | No | Override the default maximum number of documents to return. |
scale_score |
None | No | Override the default score scaling behavior. |
Output
| Key | Type | Description |
|---|---|---|
documents |
list[Document] |
Documents ranked by BM25 relevance, sorted from most to least relevant. |
The output dictionary has the structure:
{"documents": list[Document]}
Each returned Document has its score field populated with the BM25 relevance score (raw or scaled depending on configuration).
Usage Examples
Basic Keyword Retrieval
from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
docs = [
Document(content="Python is a popular programming language"),
Document(content="python ist eine beliebte Programmiersprache"),
]
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = InMemoryBM25Retriever(doc_store)
result = retriever.run(query="Programmiersprache")
print(result["documents"])
Retrieval with Filters
from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
docs = [
Document(content="Python is great for data science", meta={"language": "en"}),
Document(content="Python eignet sich hervorragend fuer Data Science", meta={"language": "de"}),
]
doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = InMemoryBM25Retriever(doc_store, top_k=5, scale_score=True)
result = retriever.run(
query="data science",
filters={"field": "meta.language", "operator": "==", "value": "en"},
)
for doc in result["documents"]:
print(f"{doc.content} (score: {doc.score:.4f})")
In a Query Pipeline
from haystack import Document, Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore
doc_store = InMemoryDocumentStore()
doc_store.write_documents([
Document(content="Machine learning is a subset of artificial intelligence"),
Document(content="Deep learning uses neural networks with many layers"),
Document(content="Natural language processing deals with text data"),
])
pipeline = Pipeline()
pipeline.add_component("retriever", InMemoryBM25Retriever(document_store=doc_store, top_k=3))
result = pipeline.run({"retriever": {"query": "neural networks deep learning"}})
for doc in result["retriever"]["documents"]:
print(doc.content, doc.score)
Related Pages
Implements Principle
- Principle: Deepset_ai_Haystack_BM25_Keyword_Retrieval -- The principle that this component implements.
- Related Implementation: Deepset_ai_Haystack_InMemoryEmbeddingRetriever -- The embedding-based retriever for the same document store.
- Related Implementation: Deepset_ai_Haystack_TransformersSimilarityRanker -- Cross-encoder ranker often used to rerank BM25 results.