Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai Benchmarks

From Leeroopedia


Knowledge Sources
Domains Benchmarking, Information Retrieval, Evaluation
Last Updated 2026-02-10 01:00 GMT

Overview

Concrete tool for running BEIR benchmark evaluations across multiple retrieval methods provided by txtai.

Description

The Benchmarks module is a comprehensive evaluation runner that tests various retrieval and search methods against the BEIR (Benchmarking IR) dataset collection. It defines a base Index class and multiple specialized index implementations:

  • Embed: Dense vector embeddings using txtai Embeddings with FAISS backend
  • Hybrid: Combined embeddings + BM25 scoring using txtai
  • RetrievalAugmentedGeneration: RAG pipeline combining embeddings retrieval with LLM re-ranking
  • Score: BM25 scoring using txtai's ScoringFactory
  • Similar: Similarity pipeline using cross-encoder or bi-encoder models
  • Rerank: Two-stage retrieval with embeddings followed by similarity re-ranking
  • RankBM25: BM25 using the rank-bm25 library
  • BM25S: BM25 using the bm25s library with Lucene-style scoring
  • SQLiteFTS: BM25 via SQLite's FTS5 full-text search extension
  • Elastic: BM25 using Elasticsearch

Each index loads a BEIR corpus (corpus.jsonl), builds an index, runs queries (queries.jsonl), and evaluates against relevance judgments using pytrec_eval. Metrics include NDCG@k, MAP@k, Recall@k, and Precision@k. Results are output as JSON lines with timing, memory, and disk usage statistics.

Usage

Use the Benchmarks script to evaluate and compare different retrieval methods on standardized BEIR datasets. It is invoked from the command line with options for selecting specific methods, data sources, configuration files, and output directories. It supports incremental runs and caching of built indexes for faster re-evaluation.

Code Reference

Source Location

Signature

class Index:
    def __init__(self, path, config, output, refresh)
    def __call__(self, limit, filterscores=True)
    def search(self, queries, limit)
    def index(self)
    def rows(self)
    def load(self)
    def batch(self, data, size)
    def readconfig(self, key, default)

class Embed(Index): ...
class Hybrid(Index): ...
class RetrievalAugmentedGeneration(Embed): ...
class Score(Index): ...
class Similar(Index): ...
class Rerank(Embed): ...
class RankBM25(Index): ...
class BM25S(Index): ...
class SQLiteFTS(Index): ...
class Elastic(Index): ...

def relevance(path)
def create(method, path, config, output, refresh)
def compute(results)
def evaluate(methods, path, args)
def benchmarks(args)

Import

# Typically run as a standalone script
python examples/benchmarks.py [options]

I/O Contract

Inputs

Name Type Required Description
-d / --directory str No Root directory path containing BEIR datasets; defaults to "beir"
-m / --methods str No Comma-separated list of methods to evaluate (embed, hybrid, rag, scoring, rank, bm25s, sqlite, es, similar, rerank)
-s / --sources str No Comma-separated list of BEIR dataset names to evaluate against
-c / --config str No Path to YAML configuration file for custom index settings
-o / --output str No Index output directory path
-r / --refresh flag No If set, rebuilds indexes even if they already exist
-t / --topk int No Top-k results for evaluation metrics; defaults to 10
-n / --name str No Name to assign to the benchmark run; defaults to method name

Outputs

Name Type Description
benchmarks.json JSON Lines file One JSON object per method-source combination containing: source, method, name, index time, memory usage, disk usage, search time, NDCG@k, MAP@k, Recall@k, P@k

Usage Examples

# Run all benchmarks on all default BEIR datasets
# python examples/benchmarks.py

# Run specific methods on specific datasets
# python examples/benchmarks.py -m "embed,hybrid" -s "nfcorpus,scifact"

# Use custom configuration and output directory
# python examples/benchmarks.py -c config.yml -o /tmp/indexes -t 20

# Refresh (rebuild) indexes
# python examples/benchmarks.py -r -m "embed" -s "nfcorpus"

# Programmatic usage
from examples.benchmarks import create, evaluate, relevance

# Create a single index
index = create("embed", "beir/nfcorpus", "config.yml", "output/embed", refresh=True)

# Run search
results = index(limit=10)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment