Implementation:Neuml Txtai Terms Index

Knowledge Sources	Neuml_Txtai
Domains	Information Retrieval, Sparse Indexing, Text Search
Last Updated	2026-02-10 01:00 GMT

Overview

Concrete tool for building, searching and storing memory-efficient term frequency sparse arrays backed by SQLite, provided by txtai.

Description

The Terms class manages a SQLite-backed sparse term frequency index. It stores term-to-document mappings as compressed binary blobs (arrays of 8-byte signed long longs) in a SQLite database with two tables: a terms table mapping each term to its document IDs and frequencies, and a documents table storing document metadata (index ID, external ID, deleted flag, and token length).

The search algorithm follows a two-phase approach similar to Apache Lucene's common terms query: first, less common terms (appearing in fewer than 10% of documents by default) are scored; then, common term scores are merged only for documents already matching the initial query. This strategy avoids full scans over high-frequency terms.

Term weights are computed using a configurable scoring function and IDF weights, with an LRU cache (maxsize=500) for frequently accessed term weight arrays. A configurable cache limit (default 250MB) controls when in-memory term data is flushed to the database. The class also supports wildcard term expansion using the asterisk operator and thread-safe database access via an RLock.

Usage

Use Terms when you need a persistent, memory-efficient sparse term frequency index for keyword-based text search. It is instantiated internally by the TFIDF scoring class when a terms configuration section is present in the scoring config. This is the core data structure behind txtai's TF-IDF, BM25, and SIF scoring search capabilities.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/scoring/terms.py

Signature

class Terms:
    def __init__(self, config, score, idf)
    def insert(self, uid, terms)
    def delete(self, ids)
    def index(self)
    def search(self, terms, limit) -> list
    def escape(self, query) -> str
    def count(self) -> int
    def load(self, path)
    def save(self, path)
    def close(self)
    def initialize(self)
    def connect(self, path="") -> connection
    def copy(self, path) -> connection
    def add(self, indexid, term, freq)
    def lookup(self, term) -> (uids, freqs)
    def expand(self, terms) -> list
    def weights(self, term) -> (uids, weights)
    def topn(self, scores, limit, hasscores, skipped) -> list
    def merge(self, scores, matches, hasscores, terms)
    def candidates(self, scores, topn) -> ndarray

Import

from txtai.scoring.terms import Terms

I/O Contract

Inputs

Name	Type	Required	Description
config	dict	Yes	Configuration dictionary; supports keys cachelimit (int, default 250000000), cutoff (float, default 0.1), and wal (bool) for WAL journal mode.
score	callable	Yes	Scoring function accepting (freq, idf, length) arrays and returning weight arrays (e.g., TF-IDF or BM25 score function).
idf	dict	Yes	Dictionary mapping terms to their IDF weight values.
uid	str/int	Yes (insert)	External document identifier.
terms	list[str]	Yes (insert/search)	List of tokenized terms for a document (insert) or query (search).
limit	int	Yes (search)	Maximum number of results to return.
path	str	Yes (load/save)	File system path for the SQLite terms database.

Outputs

Name	Type	Description
search results	list[tuple(id, float)]	List of (document_id, score) tuples sorted by descending score.
count	int	Number of non-deleted documents in the index.
uids, freqs	tuple(array, array)	Raw term frequency sparse arrays from lookup.
uids, weights	tuple(ndarray, ndarray)	Computed term weight sparse arrays from the weights method.

Usage Examples

from txtai.scoring.terms import Terms
import numpy as np

# Define a simple TF-IDF scoring function
def score_fn(freq, idf, length):
    return idf * np.sqrt(freq) * (1 / np.sqrt(length))

# IDF weights dictionary
idf = {"hello": 1.5, "world": 1.2, "foo": 2.0}

# Create a terms index
config = {"cachelimit": 100000000, "cutoff": 0.1}
terms = Terms(config, score_fn, idf)

# Insert documents
terms.insert("doc1", ["hello", "world", "hello"])
terms.insert("doc2", ["foo", "world"])

# Flush to database
terms.index()

# Search for documents matching query terms
results = terms.search(["hello", "world"], limit=10)
# Returns: [("doc1", 2.31), ("doc2", 0.85)] (approximate scores)

# Save and load
terms.save("/tmp/my_terms.db")
terms.close()

new_terms = Terms(config, score_fn, idf)
new_terms.load("/tmp/my_terms.db")
results = new_terms.search(["foo"], limit=5)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment