Implementation:Neuml Txtai Benchmarks
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Information Retrieval, Evaluation |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for running BEIR benchmark evaluations across multiple retrieval methods provided by txtai.
Description
The Benchmarks module is a comprehensive evaluation runner that tests various retrieval and search methods against the BEIR (Benchmarking IR) dataset collection. It defines a base Index class and multiple specialized index implementations:
- Embed: Dense vector embeddings using txtai Embeddings with FAISS backend
- Hybrid: Combined embeddings + BM25 scoring using txtai
- RetrievalAugmentedGeneration: RAG pipeline combining embeddings retrieval with LLM re-ranking
- Score: BM25 scoring using txtai's ScoringFactory
- Similar: Similarity pipeline using cross-encoder or bi-encoder models
- Rerank: Two-stage retrieval with embeddings followed by similarity re-ranking
- RankBM25: BM25 using the rank-bm25 library
- BM25S: BM25 using the bm25s library with Lucene-style scoring
- SQLiteFTS: BM25 via SQLite's FTS5 full-text search extension
- Elastic: BM25 using Elasticsearch
Each index loads a BEIR corpus (corpus.jsonl), builds an index, runs queries (queries.jsonl), and evaluates against relevance judgments using pytrec_eval. Metrics include NDCG@k, MAP@k, Recall@k, and Precision@k. Results are output as JSON lines with timing, memory, and disk usage statistics.
Usage
Use the Benchmarks script to evaluate and compare different retrieval methods on standardized BEIR datasets. It is invoked from the command line with options for selecting specific methods, data sources, configuration files, and output directories. It supports incremental runs and caching of built indexes for faster re-evaluation.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: examples/benchmarks.py
Signature
class Index:
def __init__(self, path, config, output, refresh)
def __call__(self, limit, filterscores=True)
def search(self, queries, limit)
def index(self)
def rows(self)
def load(self)
def batch(self, data, size)
def readconfig(self, key, default)
class Embed(Index): ...
class Hybrid(Index): ...
class RetrievalAugmentedGeneration(Embed): ...
class Score(Index): ...
class Similar(Index): ...
class Rerank(Embed): ...
class RankBM25(Index): ...
class BM25S(Index): ...
class SQLiteFTS(Index): ...
class Elastic(Index): ...
def relevance(path)
def create(method, path, config, output, refresh)
def compute(results)
def evaluate(methods, path, args)
def benchmarks(args)
Import
# Typically run as a standalone script
python examples/benchmarks.py [options]
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -d / --directory | str | No | Root directory path containing BEIR datasets; defaults to "beir" |
| -m / --methods | str | No | Comma-separated list of methods to evaluate (embed, hybrid, rag, scoring, rank, bm25s, sqlite, es, similar, rerank) |
| -s / --sources | str | No | Comma-separated list of BEIR dataset names to evaluate against |
| -c / --config | str | No | Path to YAML configuration file for custom index settings |
| -o / --output | str | No | Index output directory path |
| -r / --refresh | flag | No | If set, rebuilds indexes even if they already exist |
| -t / --topk | int | No | Top-k results for evaluation metrics; defaults to 10 |
| -n / --name | str | No | Name to assign to the benchmark run; defaults to method name |
Outputs
| Name | Type | Description |
|---|---|---|
| benchmarks.json | JSON Lines file | One JSON object per method-source combination containing: source, method, name, index time, memory usage, disk usage, search time, NDCG@k, MAP@k, Recall@k, P@k |
Usage Examples
# Run all benchmarks on all default BEIR datasets
# python examples/benchmarks.py
# Run specific methods on specific datasets
# python examples/benchmarks.py -m "embed,hybrid" -s "nfcorpus,scifact"
# Use custom configuration and output directory
# python examples/benchmarks.py -c config.yml -o /tmp/indexes -t 20
# Refresh (rebuild) indexes
# python examples/benchmarks.py -r -m "embed" -s "nfcorpus"
# Programmatic usage
from examples.benchmarks import create, evaluate, relevance
# Create a single index
index = create("embed", "beir/nfcorpus", "config.yml", "output/embed", refresh=True)
# Run search
results = index(limit=10)