Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:FlagOpen FlagEmbedding LLM Embedder BM25

From Leeroopedia


Knowledge Sources
Domains Information_Retrieval, BM25, Sparse_Retrieval
Last Updated 2026-02-09 00:00 GMT

Overview

BM25 retrieval implementations using both Anserini (Lucene-based) and in-memory Python implementations for traditional keyword-based search.

Description

This module provides two BM25 retrieval approaches:

BM25Retriever (Anserini-based):

  • Uses Lucene via Anserini for production-scale retrieval
  • index(): Converts corpus to JSON collection, builds Lucene index with configurable parameters (threads, language, storeDocvectors)
  • search(): Performs BM25 search with tunable k1 and b parameters, handles large query files by splitting into shards
  • Supports loading pre-built indices and collections for faster repeated evaluation

NaiveBM25Retriever (Pure Python):

  • Fully in-memory implementation for smaller corpora or self-retrieval scenarios
  • index(): Builds inverted index with document frequencies and term frequencies
  • search(): Computes BM25 scores using the formula: IDF * (k1+1) * tf / (tf + k1 * (1-b + b*dl))
  • Supports optional stop word filtering and processes queries/documents as either strings or pre-tokenized lists

Both implementations use standard BM25 parameters (k1=0.9, b=0.4 by default) and return ranked lists of passage indices with scores.

Usage

Use BM25Retriever for large-scale retrieval evaluation on standard benchmarks, and NaiveBM25Retriever for self-retrieval in long documents or when Anserini is unavailable.

Code Reference

Source Location

Signature

class BM25Retriever:
    def __init__(self, anserini_dir, k1=0.9, b=0.4, **kwds)
    def index(self, corpus, output_dir, threads=32, language="en",
              storeDocvectors=False, load_collection=False, load_index=False)
    def search(self, eval_data, output_dir, k1, b, hits=100, threads=32)

class NaiveBM25Retriever:
    def __init__(self, k1=0.9, b=0.4, **kwds)
    def index(self, corpus: List[str], verbose=False, stop_tokens=None)
    def search(self, queries: List[str], hits=100, k1, b, verbose=False)

Import

from research.llm_embedder.src.retrieval.modeling_bm25 import BM25Retriever, NaiveBM25Retriever

I/O Contract

Inputs

Name Type Required Description
corpus Dataset/List[str] Yes Documents to index
eval_data Dataset/str Yes Queries for search
anserini_dir str Yes Path to Anserini installation (BM25Retriever only)
k1 float No BM25 k1 parameter (default: 0.9)
b float No BM25 b parameter (default: 0.4)
hits int No Number of results to return (default: 100)

Outputs

Name Type Description
query_ids List Query identifiers
indices List[List[int]] Retrieved document indices per query
scores np.ndarray BM25 scores, shape (num_queries, hits)

Usage Examples

import datasets
from research.llm_embedder.src.retrieval.modeling_bm25 import BM25Retriever, NaiveBM25Retriever

# Anserini-based BM25
retriever = BM25Retriever(
    anserini_dir="/path/to/anserini",
    k1=0.9,
    b=0.4
)

corpus = datasets.load_dataset("json", data_files="corpus.json", split="train")
retriever.index(corpus, output_dir="./bm25_index", threads=32, language="en")

queries = datasets.load_dataset("json", data_files="queries.json", split="train")
query_ids, indices = retriever.search(queries, hits=100)

# In-memory Python BM25
naive_retriever = NaiveBM25Retriever(k1=0.9, b=0.4)
corpus_texts = ["document one text", "document two text", ...]
naive_retriever.index(corpus_texts, verbose=True)

query_texts = ["query one", "query two", ...]
scores, indices = naive_retriever.search(query_texts, hits=10, verbose=True)
print(f"Top result for query 0: doc {indices[0, 0]} with score {scores[0, 0]:.3f}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment