Implementation:Neuml Txtai Benchmarks Example
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Information_Retrieval |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Concrete tool for running comprehensive benchmark evaluations comparing different search and retrieval methods in txtai against external baselines.
Description
The benchmarks.py example implements a pluggable benchmarking framework that evaluates multiple retrieval strategies (dense embeddings, hybrid search, BM25, sparse scoring, reranking, RAG) using standard IR evaluation metrics via pytrec_eval. It defines a base Index class with subclasses for each method, loads BEIR-format datasets, runs queries, and computes NDCG/MAP/Recall scores. External baselines include Elasticsearch, rank_bm25, bm25s, and SQLite FTS.
Usage
Use this example when evaluating txtai retrieval quality against baselines on standard IR benchmark datasets (e.g., BEIR collections). It serves as a reference for how to set up comparative benchmarks and measure search effectiveness.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: examples/benchmarks.py
- Lines: 1-731
Signature
class Index:
def __init__(self, path, config, output, refresh):
"""
Creates an Index benchmark runner.
Args:
path: path to BEIR dataset
config: YAML configuration dict
output: output directory for results
refresh: if True, rebuild index from scratch
"""
def __call__(self, limit, filterscores=True):
"""Runs search evaluation and returns results dict."""
def search(self, queries, limit):
"""Executes search queries against the index."""
def index(self):
"""Builds the embeddings index."""
class Embed(Index):
"""Dense embeddings search benchmark."""
class Hybrid(Index):
"""Hybrid dense + sparse search benchmark."""
class RetrievalAugmentedGeneration(Index):
"""RAG benchmark with LLM reranking."""
class Score(Index):
"""Keyword scoring (BM25/TF-IDF) benchmark."""
class Similar(Index):
"""Similarity pipeline benchmark."""
class Rerank(Index):
"""Two-stage retrieval with reranking benchmark."""
class RankBM25(Index):
"""rank_bm25 library baseline benchmark."""
class BM25S(Index):
"""bm25s library baseline benchmark."""
class SQLiteFTS(Index):
"""SQLite full-text search baseline benchmark."""
class Elastic(Index):
"""Elasticsearch baseline benchmark."""
Import
# Run directly as a script
python examples/benchmarks.py -p /path/to/beir/dataset -c config.yml -o output/
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Path to BEIR-format dataset directory containing corpus.jsonl and queries.jsonl |
| config | str | Yes | Path to YAML configuration file specifying methods and embeddings settings |
| output | str | No | Output directory for benchmark results (CSV files) |
| refresh | bool | No | If True, rebuild indexes from scratch instead of loading existing |
| limit | int | No | Number of results to retrieve per query (default from config) |
Outputs
| Name | Type | Description |
|---|---|---|
| results | dict | Dictionary mapping method names to {query_id: {doc_id: score}} dicts |
| metrics | dict | NDCG@10, MAP, Recall scores computed via pytrec_eval |
| CSV files | Files | Per-method result files written to output directory |
Usage Examples
Running Benchmarks
# Command-line usage
# python examples/benchmarks.py -p /data/beir/nfcorpus -c config.yml -o results/
# Example config.yml:
# path: /data/beir/nfcorpus
# embed:
# path: sentence-transformers/nli-mpnet-base-v2
# content: true
# methods:
# - embed
# - hybrid
# - score
# limit: 10
# Programmatic usage
from examples.benchmarks import evaluate, create
# Create an index instance
index = create("embed", "/data/beir/nfcorpus", config, "results/", refresh=True)
# Run evaluation
results = index(limit=10)
Adding a Custom Method
# Subclass Index to add a custom retrieval method
class CustomSearch(Index):
def index(self):
"""Build custom index."""
self.embeddings = Embeddings(self.config)
self.embeddings.index(self.rows())
def search(self, queries, limit):
"""Run custom search logic."""
results = {}
for qid, query in queries:
results[qid] = self.embeddings.search(query, limit)
return results