Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Cluster TextClust

From Leeroopedia


Knowledge Sources Domains Last Updated
River River Docs Textual One-Pass Stream Clustering with Automated Distance Threshold Adaption (Assenmacher and Trautmann, 2022) Stream Clustering of Chat Messages with Applications to Twitch Streams (Carnein, Assenmacher, Trautmann, 2017) Online Clustering, Text Mining, NLP 2026-02-08 16:00 GMT

Overview

Concrete tool for performing online clustering of text data streams using TF-IDF-weighted micro-clusters, with optional automatic radius adaptation and periodic macro-clustering via agglomerative hierarchical clustering.

Description

The cluster.TextClust class implements the TextClust algorithm for streaming text clustering. It maintains a dictionary of micro-clusters, each storing TF-IDF term frequency vectors and weights. For each incoming document (represented as a bag-of-words dictionary), the algorithm computes TF-IDF cosine distances to all existing micro-clusters and either merges the document into the nearest one or creates a new micro-cluster.

The class supports automatic radius adjustment via the auto_r parameter, term-level fading via term_fading, and automatic merging of close micro-clusters during cleanup via auto_merge. Macro-clustering is performed on demand using single-linkage agglomerative hierarchical clustering to produce num_macro output clusters.

TextClust is typically used within a compose.Pipeline with a preceding feature_extraction.BagOfWords transformer that converts raw text strings into bag-of-words dictionaries.

Usage

Import cluster.TextClust when you need to cluster a stream of text documents and track evolving topics. Place it after a bag-of-words feature extractor in a pipeline.

Code Reference

Source Location

river/cluster/textclust.py:L13-L623

Signature

class TextClust(base.Clusterer):
    def __init__(
        self,
        radius=0.3,
        fading_factor=0.0005,
        tgap=100,
        term_fading=True,
        real_time_fading=True,
        micro_distance="tfidf_cosine_distance",
        macro_distance="tfidf_cosine_distance",
        num_macro=3,
        min_weight=0,
        auto_r=False,
        auto_merge=True,
        sigma=1
    )

Import

from river import cluster

Key Parameters

Parameter Default Description
radius 0.3 Distance threshold to merge a new document into an existing micro-cluster. Must be in (0, 1].
fading_factor 0.0005 Exponential fading factor for micro-cluster weights.
tgap 100 Time interval between outlier removal / cleanup operations.
term_fading True Whether individual term frequencies should also be faded over time.
real_time_fading True Whether to use wall-clock time or observation count for fading.
micro_distance "tfidf_cosine_distance" Distance metric for micro-cluster assignment.
macro_distance "tfidf_cosine_distance" Distance metric for macro-cluster generation.
num_macro 3 Number of macro-clusters to produce during reclustering.
min_weight 0 Minimum weight threshold for micro-clusters to participate in reclustering.
auto_r False Whether to automatically adapt the distance radius threshold.
auto_merge True Whether to merge close micro-clusters during cleanup.
sigma 1 Influences the automatic threshold adaptation technique.

Methods

Method Signature Description
learn_one learn_one(x: dict, t=None, w=None) -> None Processes a new bag-of-words document x, merging into the nearest micro-cluster or creating a new one.
predict_one predict_one(x: dict, w=None, type="micro") -> int Returns the micro-cluster or macro-cluster assignment for document x.

I/O Contract

Inputs

Parameter Type Description
x dict A bag-of-words dictionary mapping terms to their counts/weights.

Outputs

Output Type Description
predict_one return int The cluster ID (micro-cluster or macro-cluster depending on type parameter) assigned to the document.

Usage Examples

from river import compose
from river import feature_extraction
from river import metrics
from river import cluster

corpus = [
    {"text": 'This is the first document.', "cluster": 1},
    {"text": 'This document is the second document.', "cluster": 1},
    {"text": 'And this is super unrelated.', "cluster": 2},
    {"text": 'Is this the first document?', "cluster": 1},
    {"text": 'This is super unrelated as well', "cluster": 2},
    {"text": 'Test text', "cluster": 5}
]

stopwords = ['stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I']

metric = metrics.AdjustedRand()

model = compose.Pipeline(
    feature_extraction.BagOfWords(
        lowercase=True,
        ngram_range=(1, 2),
        stop_words=stopwords
    ),
    cluster.TextClust(
        real_time_fading=False,
        fading_factor=0.001,
        tgap=100,
        auto_r=True,
        radius=0.9
    )
)

for x in corpus:
    y_pred = model.predict_one(x["text"])
    y = x["cluster"]
    metric.update(y, y_pred)
    model.learn_one(x["text"])

print(metric)
# AdjustedRand: -0.17647058823529413

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment