Implementation:Neuml Txtai SIF Scoring
| Knowledge Sources | |
|---|---|
| Domains | Information Retrieval, Scoring, Word Embeddings |
| Last Updated | 2026-02-10 01:00 GMT |
Overview
Concrete tool for Smooth Inverse Frequency (SIF) scoring provided by txtai.
Description
The SIF class extends TFIDF to implement the Smooth Inverse Frequency weighting scheme from the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings" (Arora et al., 2017). SIF weights are primarily used for building weighted-average word embeddings that produce high-quality sentence representations.
SIF overrides two key methods from TFIDF:
- computefreq: Instead of computing per-document token counts, SIF uses corpus-wide word frequencies from the entire index (
self.wordfreq). This global frequency perspective is what distinguishes SIF from standard TF-IDF. - score: Computes the SIF weight as
a / (a + freq / total_tokens)where a is a smoothing parameter (default 1e-3). This formula assigns higher weights to rare words and lower weights to common words. When the freq and idf shapes do not match (i.e., during term index scoring), the frequency array is filled with its sum for shape compatibility.
All other functionality including document insertion, terms index management, batch search, and serialization is inherited from TFIDF.
Usage
Use SIF when you need word embedding weights that emphasize rare, informative words over common ones. SIF scoring is particularly effective when combined with word vector models (e.g., WordVectors) for building sentence embeddings. It provides a simple yet competitive baseline for sentence-level representations.
Code Reference
Source Location
- Repository: Neuml_Txtai
- File:
src/python/txtai/scoring/sif.py
Signature
class SIF(TFIDF):
def __init__(self, config=None)
def computefreq(self, tokens) -> dict
def score(self, freq, idf, length) -> ndarray
Import
from txtai.scoring import SIF
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | dict | No | Configuration dictionary. Inherits all TFIDF config keys. Additionally supports: a (float, default 1e-3), the SIF smoothing parameter controlling the weight curve. |
| tokens | list[str] | Yes (computefreq/score) | List of tokens to compute frequencies or scores for. |
| freq | ndarray | Yes (score) | Token frequency array (corpus-wide frequencies). |
| idf | float or ndarray | Yes (score) | IDF score(s) for the term(s). Used only for shape matching. |
| length | int or ndarray | Yes (score) | Document length(s). Not used in SIF score calculation. |
Outputs
| Name | Type | Description |
|---|---|---|
| freq | dict | Dictionary mapping tokens to their corpus-wide word frequencies. |
| scores | ndarray | SIF weights computed as a / (a + freq / total_tokens).
|
Usage Examples
from txtai.scoring import SIF
# Create SIF scoring with custom smoothing parameter
scoring = SIF({
"a": 1e-3,
"terms": {}
})
# Insert documents
documents = [
(0, "the cat sat on the mat", None),
(1, "dogs are loyal companions", None),
(2, "the quick brown fox jumps", None),
]
scoring.insert(documents)
scoring.index()
# Get SIF weights for token weighting in word embeddings
weights = scoring.weights(["cat", "sat", "the", "mat"])
# Common words like "the" get lower weights; rare words like "cat" get higher weights
# Use with search
results = scoring.search("cat companions", limit=5)
# Save and load
scoring.save("/tmp/sif_scoring")
scoring.close()