Implementation:Neuml Txtai Terms Index
| Knowledge Sources | |
|---|---|
| Domains | Information Retrieval, Sparse Indexing, Text Search |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for building, searching and storing memory-efficient term frequency sparse arrays backed by SQLite, provided by txtai.
Description
The Terms class manages a SQLite-backed sparse term frequency index. It stores term-to-document mappings as compressed binary blobs (arrays of 8-byte signed long longs) in a SQLite database with two tables: a terms table mapping each term to its document IDs and frequencies, and a documents table storing document metadata (index ID, external ID, deleted flag, and token length).
The search algorithm follows a two-phase approach similar to Apache Lucene's common terms query: first, less common terms (appearing in fewer than 10% of documents by default) are scored; then, common term scores are merged only for documents already matching the initial query. This strategy avoids full scans over high-frequency terms.
Term weights are computed using a configurable scoring function and IDF weights, with an LRU cache (maxsize=500) for frequently accessed term weight arrays. A configurable cache limit (default 250MB) controls when in-memory term data is flushed to the database. The class also supports wildcard term expansion using the asterisk operator and thread-safe database access via an RLock.
Usage
Use Terms when you need a persistent, memory-efficient sparse term frequency index for keyword-based text search. It is instantiated internally by the TFIDF scoring class when a terms configuration section is present in the scoring config. This is the core data structure behind txtai's TF-IDF, BM25, and SIF scoring search capabilities.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File:
src/python/txtai/scoring/terms.py
Signature
class Terms:
def __init__(self, config, score, idf)
def insert(self, uid, terms)
def delete(self, ids)
def index(self)
def search(self, terms, limit) -> list
def escape(self, query) -> str
def count(self) -> int
def load(self, path)
def save(self, path)
def close(self)
def initialize(self)
def connect(self, path="") -> connection
def copy(self, path) -> connection
def add(self, indexid, term, freq)
def lookup(self, term) -> (uids, freqs)
def expand(self, terms) -> list
def weights(self, term) -> (uids, weights)
def topn(self, scores, limit, hasscores, skipped) -> list
def merge(self, scores, matches, hasscores, terms)
def candidates(self, scores, topn) -> ndarray
Import
from txtai.scoring.terms import Terms
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | dict | Yes | Configuration dictionary; supports keys cachelimit (int, default 250000000), cutoff (float, default 0.1), and wal (bool) for WAL journal mode. |
| score | callable | Yes | Scoring function accepting (freq, idf, length) arrays and returning weight arrays (e.g., TF-IDF or BM25 score function). |
| idf | dict | Yes | Dictionary mapping terms to their IDF weight values. |
| uid | str/int | Yes (insert) | External document identifier. |
| terms | list[str] | Yes (insert/search) | List of tokenized terms for a document (insert) or query (search). |
| limit | int | Yes (search) | Maximum number of results to return. |
| path | str | Yes (load/save) | File system path for the SQLite terms database. |
Outputs
| Name | Type | Description |
|---|---|---|
| search results | list[tuple(id, float)] | List of (document_id, score) tuples sorted by descending score. |
| count | int | Number of non-deleted documents in the index. |
| uids, freqs | tuple(array, array) | Raw term frequency sparse arrays from lookup. |
| uids, weights | tuple(ndarray, ndarray) | Computed term weight sparse arrays from the weights method. |
Usage Examples
from txtai.scoring.terms import Terms
import numpy as np
# Define a simple TF-IDF scoring function
def score_fn(freq, idf, length):
return idf * np.sqrt(freq) * (1 / np.sqrt(length))
# IDF weights dictionary
idf = {"hello": 1.5, "world": 1.2, "foo": 2.0}
# Create a terms index
config = {"cachelimit": 100000000, "cutoff": 0.1}
terms = Terms(config, score_fn, idf)
# Insert documents
terms.insert("doc1", ["hello", "world", "hello"])
terms.insert("doc2", ["foo", "world"])
# Flush to database
terms.index()
# Search for documents matching query terms
results = terms.search(["hello", "world"], limit=10)
# Returns: [("doc1", 2.31), ("doc2", 0.85)] (approximate scores)
# Save and load
terms.save("/tmp/my_terms.db")
terms.close()
new_terms = Terms(config, score_fn, idf)
new_terms.load("/tmp/my_terms.db")
results = new_terms.search(["foo"], limit=5)