Implementation:FlagOpen FlagEmbedding LLM Embedder BM25
| Knowledge Sources | |
|---|---|
| Domains | Information_Retrieval, BM25, Sparse_Retrieval |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
BM25 retrieval implementations using both Anserini (Lucene-based) and in-memory Python implementations for traditional keyword-based search.
Description
This module provides two BM25 retrieval approaches:
BM25Retriever (Anserini-based):
- Uses Lucene via Anserini for production-scale retrieval
- index(): Converts corpus to JSON collection, builds Lucene index with configurable parameters (threads, language, storeDocvectors)
- search(): Performs BM25 search with tunable k1 and b parameters, handles large query files by splitting into shards
- Supports loading pre-built indices and collections for faster repeated evaluation
NaiveBM25Retriever (Pure Python):
- Fully in-memory implementation for smaller corpora or self-retrieval scenarios
- index(): Builds inverted index with document frequencies and term frequencies
- search(): Computes BM25 scores using the formula: IDF * (k1+1) * tf / (tf + k1 * (1-b + b*dl))
- Supports optional stop word filtering and processes queries/documents as either strings or pre-tokenized lists
Both implementations use standard BM25 parameters (k1=0.9, b=0.4 by default) and return ranked lists of passage indices with scores.
Usage
Use BM25Retriever for large-scale retrieval evaluation on standard benchmarks, and NaiveBM25Retriever for self-retrieval in long documents or when Anserini is unavailable.
Code Reference
Source Location
- Repository: FlagOpen_FlagEmbedding
- File: research/llm_embedder/src/retrieval/modeling_bm25.py
- Lines: 1-243
Signature
class BM25Retriever:
def __init__(self, anserini_dir, k1=0.9, b=0.4, **kwds)
def index(self, corpus, output_dir, threads=32, language="en",
storeDocvectors=False, load_collection=False, load_index=False)
def search(self, eval_data, output_dir, k1, b, hits=100, threads=32)
class NaiveBM25Retriever:
def __init__(self, k1=0.9, b=0.4, **kwds)
def index(self, corpus: List[str], verbose=False, stop_tokens=None)
def search(self, queries: List[str], hits=100, k1, b, verbose=False)
Import
from research.llm_embedder.src.retrieval.modeling_bm25 import BM25Retriever, NaiveBM25Retriever
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| corpus | Dataset/List[str] | Yes | Documents to index |
| eval_data | Dataset/str | Yes | Queries for search |
| anserini_dir | str | Yes | Path to Anserini installation (BM25Retriever only) |
| k1 | float | No | BM25 k1 parameter (default: 0.9) |
| b | float | No | BM25 b parameter (default: 0.4) |
| hits | int | No | Number of results to return (default: 100) |
Outputs
| Name | Type | Description |
|---|---|---|
| query_ids | List | Query identifiers |
| indices | List[List[int]] | Retrieved document indices per query |
| scores | np.ndarray | BM25 scores, shape (num_queries, hits) |
Usage Examples
import datasets
from research.llm_embedder.src.retrieval.modeling_bm25 import BM25Retriever, NaiveBM25Retriever
# Anserini-based BM25
retriever = BM25Retriever(
anserini_dir="/path/to/anserini",
k1=0.9,
b=0.4
)
corpus = datasets.load_dataset("json", data_files="corpus.json", split="train")
retriever.index(corpus, output_dir="./bm25_index", threads=32, language="en")
queries = datasets.load_dataset("json", data_files="queries.json", split="train")
query_ids, indices = retriever.search(queries, hits=100)
# In-memory Python BM25
naive_retriever = NaiveBM25Retriever(k1=0.9, b=0.4)
corpus_texts = ["document one text", "document two text", ...]
naive_retriever.index(corpus_texts, verbose=True)
query_texts = ["query one", "query two", ...]
scores, indices = naive_retriever.search(query_texts, hits=10, verbose=True)
print(f"Top result for query 0: doc {indices[0, 0]} with score {scores[0, 0]:.3f}")