Implementation:Neuml Txtai Benchmarks Example

Knowledge Sources	Neuml_Txtai
Domains	Benchmarking, Information_Retrieval
Last Updated	2026-02-09 17:00 GMT

Overview

Concrete tool for running comprehensive benchmark evaluations comparing different search and retrieval methods in txtai against external baselines.

Description

The benchmarks.py example implements a pluggable benchmarking framework that evaluates multiple retrieval strategies (dense embeddings, hybrid search, BM25, sparse scoring, reranking, RAG) using standard IR evaluation metrics via pytrec_eval. It defines a base Index class with subclasses for each method, loads BEIR-format datasets, runs queries, and computes NDCG/MAP/Recall scores. External baselines include Elasticsearch, rank_bm25, bm25s, and SQLite FTS.

Usage

Use this example when evaluating txtai retrieval quality against baselines on standard IR benchmark datasets (e.g., BEIR collections). It serves as a reference for how to set up comparative benchmarks and measure search effectiveness.

Code Reference

Source Location

Repository: Neuml_Txtai
File: examples/benchmarks.py
Lines: 1-731

Signature

class Index:
    def __init__(self, path, config, output, refresh):
        """
        Creates an Index benchmark runner.

        Args:
            path: path to BEIR dataset
            config: YAML configuration dict
            output: output directory for results
            refresh: if True, rebuild index from scratch
        """

    def __call__(self, limit, filterscores=True):
        """Runs search evaluation and returns results dict."""

    def search(self, queries, limit):
        """Executes search queries against the index."""

    def index(self):
        """Builds the embeddings index."""

class Embed(Index):
    """Dense embeddings search benchmark."""

class Hybrid(Index):
    """Hybrid dense + sparse search benchmark."""

class RetrievalAugmentedGeneration(Index):
    """RAG benchmark with LLM reranking."""

class Score(Index):
    """Keyword scoring (BM25/TF-IDF) benchmark."""

class Similar(Index):
    """Similarity pipeline benchmark."""

class Rerank(Index):
    """Two-stage retrieval with reranking benchmark."""

class RankBM25(Index):
    """rank_bm25 library baseline benchmark."""

class BM25S(Index):
    """bm25s library baseline benchmark."""

class SQLiteFTS(Index):
    """SQLite full-text search baseline benchmark."""

class Elastic(Index):
    """Elasticsearch baseline benchmark."""

Import

# Run directly as a script
python examples/benchmarks.py -p /path/to/beir/dataset -c config.yml -o output/

I/O Contract

Inputs

Name	Type	Required	Description
path	str	Yes	Path to BEIR-format dataset directory containing corpus.jsonl and queries.jsonl
config	str	Yes	Path to YAML configuration file specifying methods and embeddings settings
output	str	No	Output directory for benchmark results (CSV files)
refresh	bool	No	If True, rebuild indexes from scratch instead of loading existing
limit	int	No	Number of results to retrieve per query (default from config)

Outputs

Name	Type	Description
results	dict	Dictionary mapping method names to {query_id: {doc_id: score}} dicts
metrics	dict	NDCG@10, MAP, Recall scores computed via pytrec_eval
CSV files	Files	Per-method result files written to output directory

Usage Examples

Running Benchmarks

# Command-line usage
# python examples/benchmarks.py -p /data/beir/nfcorpus -c config.yml -o results/

# Example config.yml:
# path: /data/beir/nfcorpus
# embed:
#   path: sentence-transformers/nli-mpnet-base-v2
#   content: true
# methods:
#   - embed
#   - hybrid
#   - score
# limit: 10

# Programmatic usage
from examples.benchmarks import evaluate, create

# Create an index instance
index = create("embed", "/data/beir/nfcorpus", config, "results/", refresh=True)

# Run evaluation
results = index(limit=10)

Adding a Custom Method

# Subclass Index to add a custom retrieval method
class CustomSearch(Index):
    def index(self):
        """Build custom index."""
        self.embeddings = Embeddings(self.config)
        self.embeddings.index(self.rows())

    def search(self, queries, limit):
        """Run custom search logic."""
        results = {}
        for qid, query in queries:
            results[qid] = self.embeddings.search(query, limit)
        return results

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment