Implementation:Neuml Txtai TFIDF Scoring

Knowledge Sources	Neuml_Txtai
Domains	Text_Retrieval, Keyword_Search
Last Updated	2026-02-09 17:00 GMT

Overview

TFIDF is a term frequency-inverse document frequency scoring engine that provides keyword-based search and document weighting for txtai's scoring subsystem.

Description

The TFIDF class inherits from Scoring and implements classical TF-IDF scoring with BM25-style normalization for ranking documents by keyword relevance. It maintains document-level statistics including word frequency, document frequency, IDF scores, and average document length. The class supports an optional Terms index for full inverted-index keyword search, tag-based boosting, score normalization, content storage, and multithreaded batch search. It integrates with ScoringFactory for automatic instantiation based on configuration.

Usage

Use TFIDF when you need keyword-based document retrieval or token weighting as part of a hybrid search strategy. It is commonly paired with vector-based embeddings search to combine semantic and keyword relevance. It is also used standalone for lightweight text search without neural embeddings.

Code Reference

Source Location

Repository: Neuml_Txtai
File: src/python/txtai/scoring/tfidf.py
Lines: 1-361

Signature

class TFIDF(Scoring):
    def __init__(self, config=None):
        """
        Creates a new TFIDF scoring instance.

        Args:
            config: scoring configuration dict
        """

Import

from txtai.scoring import TFIDF

Key Methods

Method	Description
`insert(documents, index=None, checkpoint=None)`	Inserts documents into the scoring index. Each document is tokenized and its statistics (word frequency, document frequency, tags) are accumulated.
`delete(ids)`	Deletes documents by id from the terms index and content store.
`index(documents=None)`	Finalizes the index by computing IDF scores, average document length, average frequency, average score, and building the terms index.
`weights(tokens)`	Computes a TF-IDF weight for each token in a token list. Applies tag boosting when tag tokens are present.
`search(query, limit=3)`	Searches the terms index for documents matching the query. Returns results sorted by score, optionally normalized.
`batchsearch(queries, limit=3, threads=True)`	Runs multiple search queries in parallel using a thread pool. Thread count is auto-scaled based on index size.
`count()`	Returns the total number of documents in the index.
`load(path)`	Loads a previously saved scoring index from disk using the Serializer.
`save(path)`	Persists the scoring index to disk using the Serializer.

I/O Contract

Inputs

Name	Type	Required	Description
config	dict	No	Configuration dictionary. Key options include `terms` (enables inverted index), `content` (enables document storage), `normalize` (enables score normalization), and `tokenizer` (custom tokenizer settings).
documents	iterable	Yes (for insert)	Iterable of `(uid, document, tags)` tuples. Document can be a string, list of tokens, or dict with text/object keys.
query	str	Yes (for search)	Search query string. Tokenized internally using the configured tokenizer.
limit	int	No	Maximum number of results to return. Defaults to 3.

Outputs

Name	Type	Description
search results	list of tuple or list of dict	List of `(id, score)` tuples when content storage is disabled. List of `{"id": ..., "text": ..., "score": ...}` dicts when content storage is enabled.
weights	list of float	Per-token TF-IDF weight scores from the `weights()` method.
count	int	Total document count from `count()`.

Usage Examples

Basic Usage

from txtai.scoring import ScoringFactory

# Create a TF-IDF scoring instance with a terms index
scoring = ScoringFactory.create({"method": "tfidf", "terms": True, "content": True})

# Insert documents
documents = [
    (0, "machine learning algorithms for classification", None),
    (1, "deep learning neural networks and transformers", None),
    (2, "natural language processing with text embeddings", None),
    (3, "keyword search using term frequency scoring", None),
]

scoring.insert(documents)
scoring.index()

# Search for relevant documents
results = scoring.search("neural network classification", limit=3)
for result in results:
    print(f"ID: {result['id']}, Score: {result['score']:.4f}, Text: {result['text']}")

Token Weighting

from txtai.scoring import ScoringFactory

scoring = ScoringFactory.create({"method": "tfidf", "terms": True})

documents = [
    (0, "the quick brown fox jumps over the lazy dog", None),
    (1, "a fast red fox leaps across the sleeping hound", None),
]

scoring.insert(documents)
scoring.index()

# Get TF-IDF weights for tokens
tokens = ["quick", "brown", "fox"]
weights = scoring.weights(tokens)
for token, weight in zip(tokens, weights):
    print(f"Token: {token}, Weight: {weight:.4f}")

Batch Search

from txtai.scoring import ScoringFactory

scoring = ScoringFactory.create({"method": "tfidf", "terms": True})

# Insert and index documents
scoring.insert([(i, f"Document {i} about topic {i % 3}", None) for i in range(100)])
scoring.index()

# Run multiple queries in parallel
queries = ["topic 0", "topic 1", "topic 2"]
results = scoring.batchsearch(queries, limit=5)
for query, result in zip(queries, results):
    print(f"Query: {query}, Results: {len(result)}")

Related Pages

Principle:Neuml_Txtai_Keyword_Scoring

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment