Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Deepset ai Haystack InMemoryBM25Retriever

From Leeroopedia

Metadata

Field Value
Implementation Name InMemoryBM25Retriever
Implementing Principle Deepset_ai_Haystack_BM25_Keyword_Retrieval
Class InMemoryBM25Retriever
Module haystack.components.retrievers.in_memory.bm25_retriever
Source Reference haystack/components/retrievers/in_memory/bm25_retriever.py:L13-196
Repository Deepset_ai_Haystack
Dependencies rank_bm25 (via InMemoryDocumentStore)

Overview

InMemoryBM25Retriever is a Haystack component that retrieves documents from an InMemoryDocumentStore using BM25 keyword-based scoring. It ranks documents by their lexical relevance to a text query, leveraging the BM25 algorithm implemented within the document store via the rank_bm25 library. No neural model or GPU is required.

Description

This retriever delegates the actual BM25 scoring to the InMemoryDocumentStore.bm25_retrieval() method. The component's role is to serve as a pipeline-compatible interface that manages configuration, filter policies, and parameter overrides at runtime.

Key behaviors:

  • Document store validation: The constructor enforces that the provided document store is an instance of InMemoryDocumentStore, raising a ValueError otherwise.
  • Filter policy: Supports two filter policies via FilterPolicy:
    • REPLACE (default): Runtime filters completely override initialization filters.
    • MERGE: Runtime filters are merged with initialization filters to narrow the search.
  • Score scaling: When scale_score=True, raw BM25 scores are normalized to a 0-1 range where 1 indicates maximum relevance.
  • Top-k validation: The top_k parameter must be greater than 0; a ValueError is raised otherwise.
  • Async support: The component provides a run_async() method that calls the document store's asynchronous BM25 retrieval interface.

Code Reference

Import

from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

Constructor Signature

InMemoryBM25Retriever(
    document_store: InMemoryDocumentStore,
    filters: dict[str, Any] | None = None,
    top_k: int = 10,
    scale_score: bool = False,
    filter_policy: FilterPolicy = FilterPolicy.REPLACE,
)
Parameter Type Default Description
document_store InMemoryDocumentStore required The document store to retrieve from. Must be an InMemoryDocumentStore instance.
filters None None Default filters to narrow the search space.
top_k int 10 Maximum number of documents to return.
scale_score bool False When True, scales scores to a 0-1 range.
filter_policy FilterPolicy FilterPolicy.REPLACE How runtime filters interact with init filters (REPLACE or MERGE).

I/O Contract

Input

Parameter Type Required Description
query str Yes The query string to match against documents.
filters None No Runtime filters to apply. Behavior depends on filter_policy.
top_k None No Override the default maximum number of documents to return.
scale_score None No Override the default score scaling behavior.

Output

Key Type Description
documents list[Document] Documents ranked by BM25 relevance, sorted from most to least relevant.

The output dictionary has the structure:

{"documents": list[Document]}

Each returned Document has its score field populated with the BM25 relevance score (raw or scaled depending on configuration).

Usage Examples

Basic Keyword Retrieval

from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

docs = [
    Document(content="Python is a popular programming language"),
    Document(content="python ist eine beliebte Programmiersprache"),
]

doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)
retriever = InMemoryBM25Retriever(doc_store)

result = retriever.run(query="Programmiersprache")
print(result["documents"])

Retrieval with Filters

from haystack import Document
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

docs = [
    Document(content="Python is great for data science", meta={"language": "en"}),
    Document(content="Python eignet sich hervorragend fuer Data Science", meta={"language": "de"}),
]

doc_store = InMemoryDocumentStore()
doc_store.write_documents(docs)

retriever = InMemoryBM25Retriever(doc_store, top_k=5, scale_score=True)
result = retriever.run(
    query="data science",
    filters={"field": "meta.language", "operator": "==", "value": "en"},
)
for doc in result["documents"]:
    print(f"{doc.content} (score: {doc.score:.4f})")

In a Query Pipeline

from haystack import Document, Pipeline
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.document_stores.in_memory import InMemoryDocumentStore

doc_store = InMemoryDocumentStore()
doc_store.write_documents([
    Document(content="Machine learning is a subset of artificial intelligence"),
    Document(content="Deep learning uses neural networks with many layers"),
    Document(content="Natural language processing deals with text data"),
])

pipeline = Pipeline()
pipeline.add_component("retriever", InMemoryBM25Retriever(document_store=doc_store, top_k=3))

result = pipeline.run({"retriever": {"query": "neural networks deep learning"}})
for doc in result["retriever"]["documents"]:
    print(doc.content, doc.score)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment