Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai TFIDF Scoring

From Leeroopedia


Knowledge Sources
Domains Text_Retrieval, Keyword_Search
Last Updated 2026-02-09 17:00 GMT

Overview

TFIDF is a term frequency-inverse document frequency scoring engine that provides keyword-based search and document weighting for txtai's scoring subsystem.

Description

The TFIDF class inherits from Scoring and implements classical TF-IDF scoring with BM25-style normalization for ranking documents by keyword relevance. It maintains document-level statistics including word frequency, document frequency, IDF scores, and average document length. The class supports an optional Terms index for full inverted-index keyword search, tag-based boosting, score normalization, content storage, and multithreaded batch search. It integrates with ScoringFactory for automatic instantiation based on configuration.

Usage

Use TFIDF when you need keyword-based document retrieval or token weighting as part of a hybrid search strategy. It is commonly paired with vector-based embeddings search to combine semantic and keyword relevance. It is also used standalone for lightweight text search without neural embeddings.

Code Reference

Source Location

Signature

class TFIDF(Scoring):
    def __init__(self, config=None):
        """
        Creates a new TFIDF scoring instance.

        Args:
            config: scoring configuration dict
        """

Import

from txtai.scoring import TFIDF

Key Methods

Method Description
insert(documents, index=None, checkpoint=None) Inserts documents into the scoring index. Each document is tokenized and its statistics (word frequency, document frequency, tags) are accumulated.
delete(ids) Deletes documents by id from the terms index and content store.
index(documents=None) Finalizes the index by computing IDF scores, average document length, average frequency, average score, and building the terms index.
weights(tokens) Computes a TF-IDF weight for each token in a token list. Applies tag boosting when tag tokens are present.
search(query, limit=3) Searches the terms index for documents matching the query. Returns results sorted by score, optionally normalized.
batchsearch(queries, limit=3, threads=True) Runs multiple search queries in parallel using a thread pool. Thread count is auto-scaled based on index size.
count() Returns the total number of documents in the index.
load(path) Loads a previously saved scoring index from disk using the Serializer.
save(path) Persists the scoring index to disk using the Serializer.

I/O Contract

Inputs

Name Type Required Description
config dict No Configuration dictionary. Key options include terms (enables inverted index), content (enables document storage), normalize (enables score normalization), and tokenizer (custom tokenizer settings).
documents iterable Yes (for insert) Iterable of (uid, document, tags) tuples. Document can be a string, list of tokens, or dict with text/object keys.
query str Yes (for search) Search query string. Tokenized internally using the configured tokenizer.
limit int No Maximum number of results to return. Defaults to 3.

Outputs

Name Type Description
search results list of tuple or list of dict List of (id, score) tuples when content storage is disabled. List of {"id": ..., "text": ..., "score": ...} dicts when content storage is enabled.
weights list of float Per-token TF-IDF weight scores from the weights() method.
count int Total document count from count().

Usage Examples

Basic Usage

from txtai.scoring import ScoringFactory

# Create a TF-IDF scoring instance with a terms index
scoring = ScoringFactory.create({"method": "tfidf", "terms": True, "content": True})

# Insert documents
documents = [
    (0, "machine learning algorithms for classification", None),
    (1, "deep learning neural networks and transformers", None),
    (2, "natural language processing with text embeddings", None),
    (3, "keyword search using term frequency scoring", None),
]

scoring.insert(documents)
scoring.index()

# Search for relevant documents
results = scoring.search("neural network classification", limit=3)
for result in results:
    print(f"ID: {result['id']}, Score: {result['score']:.4f}, Text: {result['text']}")

Token Weighting

from txtai.scoring import ScoringFactory

scoring = ScoringFactory.create({"method": "tfidf", "terms": True})

documents = [
    (0, "the quick brown fox jumps over the lazy dog", None),
    (1, "a fast red fox leaps across the sleeping hound", None),
]

scoring.insert(documents)
scoring.index()

# Get TF-IDF weights for tokens
tokens = ["quick", "brown", "fox"]
weights = scoring.weights(tokens)
for token, weight in zip(tokens, weights):
    print(f"Token: {token}, Weight: {weight:.4f}")

Batch Search

from txtai.scoring import ScoringFactory

scoring = ScoringFactory.create({"method": "tfidf", "terms": True})

# Insert and index documents
scoring.insert([(i, f"Document {i} about topic {i % 3}", None) for i in range(100)])
scoring.index()

# Run multiple queries in parallel
queries = ["topic 0", "topic 1", "topic 2"]
results = scoring.batchsearch(queries, limit=5)
for query, result in zip(queries, results):
    print(f"Query: {query}, Results: {len(result)}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment