Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai Terms Index

From Leeroopedia


Knowledge Sources
Domains Information Retrieval, Sparse Indexing, Text Search
Last Updated 2026-02-10 01:00 GMT

Overview

Concrete tool for building, searching and storing memory-efficient term frequency sparse arrays backed by SQLite, provided by txtai.

Description

The Terms class manages a SQLite-backed sparse term frequency index. It stores term-to-document mappings as compressed binary blobs (arrays of 8-byte signed long longs) in a SQLite database with two tables: a terms table mapping each term to its document IDs and frequencies, and a documents table storing document metadata (index ID, external ID, deleted flag, and token length).

The search algorithm follows a two-phase approach similar to Apache Lucene's common terms query: first, less common terms (appearing in fewer than 10% of documents by default) are scored; then, common term scores are merged only for documents already matching the initial query. This strategy avoids full scans over high-frequency terms.

Term weights are computed using a configurable scoring function and IDF weights, with an LRU cache (maxsize=500) for frequently accessed term weight arrays. A configurable cache limit (default 250MB) controls when in-memory term data is flushed to the database. The class also supports wildcard term expansion using the asterisk operator and thread-safe database access via an RLock.

Usage

Use Terms when you need a persistent, memory-efficient sparse term frequency index for keyword-based text search. It is instantiated internally by the TFIDF scoring class when a terms configuration section is present in the scoring config. This is the core data structure behind txtai's TF-IDF, BM25, and SIF scoring search capabilities.

Code Reference

Source Location

  • Repository: Neuml_Txtai
  • File: src/python/txtai/scoring/terms.py

Signature

class Terms:
    def __init__(self, config, score, idf)
    def insert(self, uid, terms)
    def delete(self, ids)
    def index(self)
    def search(self, terms, limit) -> list
    def escape(self, query) -> str
    def count(self) -> int
    def load(self, path)
    def save(self, path)
    def close(self)
    def initialize(self)
    def connect(self, path="") -> connection
    def copy(self, path) -> connection
    def add(self, indexid, term, freq)
    def lookup(self, term) -> (uids, freqs)
    def expand(self, terms) -> list
    def weights(self, term) -> (uids, weights)
    def topn(self, scores, limit, hasscores, skipped) -> list
    def merge(self, scores, matches, hasscores, terms)
    def candidates(self, scores, topn) -> ndarray

Import

from txtai.scoring.terms import Terms

I/O Contract

Inputs

Name Type Required Description
config dict Yes Configuration dictionary; supports keys cachelimit (int, default 250000000), cutoff (float, default 0.1), and wal (bool) for WAL journal mode.
score callable Yes Scoring function accepting (freq, idf, length) arrays and returning weight arrays (e.g., TF-IDF or BM25 score function).
idf dict Yes Dictionary mapping terms to their IDF weight values.
uid str/int Yes (insert) External document identifier.
terms list[str] Yes (insert/search) List of tokenized terms for a document (insert) or query (search).
limit int Yes (search) Maximum number of results to return.
path str Yes (load/save) File system path for the SQLite terms database.

Outputs

Name Type Description
search results list[tuple(id, float)] List of (document_id, score) tuples sorted by descending score.
count int Number of non-deleted documents in the index.
uids, freqs tuple(array, array) Raw term frequency sparse arrays from lookup.
uids, weights tuple(ndarray, ndarray) Computed term weight sparse arrays from the weights method.

Usage Examples

from txtai.scoring.terms import Terms
import numpy as np

# Define a simple TF-IDF scoring function
def score_fn(freq, idf, length):
    return idf * np.sqrt(freq) * (1 / np.sqrt(length))

# IDF weights dictionary
idf = {"hello": 1.5, "world": 1.2, "foo": 2.0}

# Create a terms index
config = {"cachelimit": 100000000, "cutoff": 0.1}
terms = Terms(config, score_fn, idf)

# Insert documents
terms.insert("doc1", ["hello", "world", "hello"])
terms.insert("doc2", ["foo", "world"])

# Flush to database
terms.index()

# Search for documents matching query terms
results = terms.search(["hello", "world"], limit=10)
# Returns: [("doc1", 2.31), ("doc2", 0.85)] (approximate scores)

# Save and load
terms.save("/tmp/my_terms.db")
terms.close()

new_terms = Terms(config, score_fn, idf)
new_terms.load("/tmp/my_terms.db")
results = new_terms.search(["foo"], limit=5)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment