Implementation:Neuml Txtai TFIDF Scoring
| Knowledge Sources | |
|---|---|
| Domains | Text_Retrieval, Keyword_Search |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
TFIDF is a term frequency-inverse document frequency scoring engine that provides keyword-based search and document weighting for txtai's scoring subsystem.
Description
The TFIDF class inherits from Scoring and implements classical TF-IDF scoring with BM25-style normalization for ranking documents by keyword relevance. It maintains document-level statistics including word frequency, document frequency, IDF scores, and average document length. The class supports an optional Terms index for full inverted-index keyword search, tag-based boosting, score normalization, content storage, and multithreaded batch search. It integrates with ScoringFactory for automatic instantiation based on configuration.
Usage
Use TFIDF when you need keyword-based document retrieval or token weighting as part of a hybrid search strategy. It is commonly paired with vector-based embeddings search to combine semantic and keyword relevance. It is also used standalone for lightweight text search without neural embeddings.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File: src/python/txtai/scoring/tfidf.py
- Lines: 1-361
Signature
class TFIDF(Scoring):
def __init__(self, config=None):
"""
Creates a new TFIDF scoring instance.
Args:
config: scoring configuration dict
"""
Import
from txtai.scoring import TFIDF
Key Methods
| Method | Description |
|---|---|
insert(documents, index=None, checkpoint=None) |
Inserts documents into the scoring index. Each document is tokenized and its statistics (word frequency, document frequency, tags) are accumulated. |
delete(ids) |
Deletes documents by id from the terms index and content store. |
index(documents=None) |
Finalizes the index by computing IDF scores, average document length, average frequency, average score, and building the terms index. |
weights(tokens) |
Computes a TF-IDF weight for each token in a token list. Applies tag boosting when tag tokens are present. |
search(query, limit=3) |
Searches the terms index for documents matching the query. Returns results sorted by score, optionally normalized. |
batchsearch(queries, limit=3, threads=True) |
Runs multiple search queries in parallel using a thread pool. Thread count is auto-scaled based on index size. |
count() |
Returns the total number of documents in the index. |
load(path) |
Loads a previously saved scoring index from disk using the Serializer. |
save(path) |
Persists the scoring index to disk using the Serializer. |
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | dict | No | Configuration dictionary. Key options include terms (enables inverted index), content (enables document storage), normalize (enables score normalization), and tokenizer (custom tokenizer settings).
|
| documents | iterable | Yes (for insert) | Iterable of (uid, document, tags) tuples. Document can be a string, list of tokens, or dict with text/object keys.
|
| query | str | Yes (for search) | Search query string. Tokenized internally using the configured tokenizer. |
| limit | int | No | Maximum number of results to return. Defaults to 3. |
Outputs
| Name | Type | Description |
|---|---|---|
| search results | list of tuple or list of dict | List of (id, score) tuples when content storage is disabled. List of {"id": ..., "text": ..., "score": ...} dicts when content storage is enabled.
|
| weights | list of float | Per-token TF-IDF weight scores from the weights() method.
|
| count | int | Total document count from count().
|
Usage Examples
Basic Usage
from txtai.scoring import ScoringFactory
# Create a TF-IDF scoring instance with a terms index
scoring = ScoringFactory.create({"method": "tfidf", "terms": True, "content": True})
# Insert documents
documents = [
(0, "machine learning algorithms for classification", None),
(1, "deep learning neural networks and transformers", None),
(2, "natural language processing with text embeddings", None),
(3, "keyword search using term frequency scoring", None),
]
scoring.insert(documents)
scoring.index()
# Search for relevant documents
results = scoring.search("neural network classification", limit=3)
for result in results:
print(f"ID: {result['id']}, Score: {result['score']:.4f}, Text: {result['text']}")
Token Weighting
from txtai.scoring import ScoringFactory
scoring = ScoringFactory.create({"method": "tfidf", "terms": True})
documents = [
(0, "the quick brown fox jumps over the lazy dog", None),
(1, "a fast red fox leaps across the sleeping hound", None),
]
scoring.insert(documents)
scoring.index()
# Get TF-IDF weights for tokens
tokens = ["quick", "brown", "fox"]
weights = scoring.weights(tokens)
for token, weight in zip(tokens, weights):
print(f"Token: {token}, Weight: {weight:.4f}")
Batch Search
from txtai.scoring import ScoringFactory
scoring = ScoringFactory.create({"method": "tfidf", "terms": True})
# Insert and index documents
scoring.insert([(i, f"Document {i} about topic {i % 3}", None) for i in range(100)])
scoring.index()
# Run multiple queries in parallel
queries = ["topic 0", "topic 1", "topic 2"]
results = scoring.batchsearch(queries, limit=5)
for query, result in zip(queries, results):
print(f"Query: {query}, Results: {len(result)}")