Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Neuml Txtai SIF Scoring

From Leeroopedia


Knowledge Sources
Domains Information Retrieval, Scoring, Word Embeddings
Last Updated 2026-02-10 01:00 GMT

Overview

Concrete tool for Smooth Inverse Frequency (SIF) scoring provided by txtai.

Description

The SIF class extends TFIDF to implement the Smooth Inverse Frequency weighting scheme from the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings" (Arora et al., 2017). SIF weights are primarily used for building weighted-average word embeddings that produce high-quality sentence representations.

SIF overrides two key methods from TFIDF:

  • computefreq: Instead of computing per-document token counts, SIF uses corpus-wide word frequencies from the entire index (self.wordfreq). This global frequency perspective is what distinguishes SIF from standard TF-IDF.
  • score: Computes the SIF weight as a / (a + freq / total_tokens) where a is a smoothing parameter (default 1e-3). This formula assigns higher weights to rare words and lower weights to common words. When the freq and idf shapes do not match (i.e., during term index scoring), the frequency array is filled with its sum for shape compatibility.

All other functionality including document insertion, terms index management, batch search, and serialization is inherited from TFIDF.

Usage

Use SIF when you need word embedding weights that emphasize rare, informative words over common ones. SIF scoring is particularly effective when combined with word vector models (e.g., WordVectors) for building sentence embeddings. It provides a simple yet competitive baseline for sentence-level representations.

Code Reference

Source Location

  • Repository: Neuml_Txtai
  • File: src/python/txtai/scoring/sif.py

Signature

class SIF(TFIDF):
    def __init__(self, config=None)
    def computefreq(self, tokens) -> dict
    def score(self, freq, idf, length) -> ndarray

Import

from txtai.scoring import SIF

I/O Contract

Inputs

Name Type Required Description
config dict No Configuration dictionary. Inherits all TFIDF config keys. Additionally supports: a (float, default 1e-3), the SIF smoothing parameter controlling the weight curve.
tokens list[str] Yes (computefreq/score) List of tokens to compute frequencies or scores for.
freq ndarray Yes (score) Token frequency array (corpus-wide frequencies).
idf float or ndarray Yes (score) IDF score(s) for the term(s). Used only for shape matching.
length int or ndarray Yes (score) Document length(s). Not used in SIF score calculation.

Outputs

Name Type Description
freq dict Dictionary mapping tokens to their corpus-wide word frequencies.
scores ndarray SIF weights computed as a / (a + freq / total_tokens).

Usage Examples

from txtai.scoring import SIF

# Create SIF scoring with custom smoothing parameter
scoring = SIF({
    "a": 1e-3,
    "terms": {}
})

# Insert documents
documents = [
    (0, "the cat sat on the mat", None),
    (1, "dogs are loyal companions", None),
    (2, "the quick brown fox jumps", None),
]

scoring.insert(documents)
scoring.index()

# Get SIF weights for token weighting in word embeddings
weights = scoring.weights(["cat", "sat", "the", "mat"])
# Common words like "the" get lower weights; rare words like "cat" get higher weights

# Use with search
results = scoring.search("cat companions", limit=5)

# Save and load
scoring.save("/tmp/sif_scoring")
scoring.close()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment