Implementation:Online ml River Cluster TextClust

Knowledge Sources	Domains	Last Updated
River River Docs Textual One-Pass Stream Clustering with Automated Distance Threshold Adaption (Assenmacher and Trautmann, 2022) Stream Clustering of Chat Messages with Applications to Twitch Streams (Carnein, Assenmacher, Trautmann, 2017)	Online Clustering, Text Mining, NLP	2026-02-08 16:00 GMT

Overview

Concrete tool for performing online clustering of text data streams using TF-IDF-weighted micro-clusters, with optional automatic radius adaptation and periodic macro-clustering via agglomerative hierarchical clustering.

Description

The cluster.TextClust class implements the TextClust algorithm for streaming text clustering. It maintains a dictionary of micro-clusters, each storing TF-IDF term frequency vectors and weights. For each incoming document (represented as a bag-of-words dictionary), the algorithm computes TF-IDF cosine distances to all existing micro-clusters and either merges the document into the nearest one or creates a new micro-cluster.

The class supports automatic radius adjustment via the auto_r parameter, term-level fading via term_fading, and automatic merging of close micro-clusters during cleanup via auto_merge. Macro-clustering is performed on demand using single-linkage agglomerative hierarchical clustering to produce num_macro output clusters.

TextClust is typically used within a compose.Pipeline with a preceding feature_extraction.BagOfWords transformer that converts raw text strings into bag-of-words dictionaries.

Usage

Import cluster.TextClust when you need to cluster a stream of text documents and track evolving topics. Place it after a bag-of-words feature extractor in a pipeline.

Code Reference

Source Location

river/cluster/textclust.py:L13-L623

Signature

class TextClust(base.Clusterer):
    def __init__(
        self,
        radius=0.3,
        fading_factor=0.0005,
        tgap=100,
        term_fading=True,
        real_time_fading=True,
        micro_distance="tfidf_cosine_distance",
        macro_distance="tfidf_cosine_distance",
        num_macro=3,
        min_weight=0,
        auto_r=False,
        auto_merge=True,
        sigma=1
    )

Import

from river import cluster

Key Parameters

Parameter	Default	Description
radius	0.3	Distance threshold to merge a new document into an existing micro-cluster. Must be in (0, 1].
fading_factor	0.0005	Exponential fading factor for micro-cluster weights.
tgap	100	Time interval between outlier removal / cleanup operations.
term_fading	True	Whether individual term frequencies should also be faded over time.
real_time_fading	True	Whether to use wall-clock time or observation count for fading.
micro_distance	"tfidf_cosine_distance"	Distance metric for micro-cluster assignment.
macro_distance	"tfidf_cosine_distance"	Distance metric for macro-cluster generation.
num_macro	3	Number of macro-clusters to produce during reclustering.
min_weight	0	Minimum weight threshold for micro-clusters to participate in reclustering.
auto_r	False	Whether to automatically adapt the distance radius threshold.
auto_merge	True	Whether to merge close micro-clusters during cleanup.
sigma	1	Influences the automatic threshold adaptation technique.

Methods

Method	Signature	Description
learn_one	`learn_one(x: dict, t=None, w=None) -> None`	Processes a new bag-of-words document x, merging into the nearest micro-cluster or creating a new one.
predict_one	`predict_one(x: dict, w=None, type="micro") -> int`	Returns the micro-cluster or macro-cluster assignment for document x.

I/O Contract

Inputs

Parameter	Type	Description
x	`dict`	A bag-of-words dictionary mapping terms to their counts/weights.

Outputs

Output	Type	Description
predict_one return	`int`	The cluster ID (micro-cluster or macro-cluster depending on `type` parameter) assigned to the document.

Usage Examples

from river import compose
from river import feature_extraction
from river import metrics
from river import cluster

corpus = [
    {"text": 'This is the first document.', "cluster": 1},
    {"text": 'This document is the second document.', "cluster": 1},
    {"text": 'And this is super unrelated.', "cluster": 2},
    {"text": 'Is this the first document?', "cluster": 1},
    {"text": 'This is super unrelated as well', "cluster": 2},
    {"text": 'Test text', "cluster": 5}
]

stopwords = ['stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I']

metric = metrics.AdjustedRand()

model = compose.Pipeline(
    feature_extraction.BagOfWords(
        lowercase=True,
        ngram_range=(1, 2),
        stop_words=stopwords
    ),
    cluster.TextClust(
        real_time_fading=False,
        fading_factor=0.001,
        tgap=100,
        auto_r=True,
        radius=0.9
    )
)

for x in corpus:
    y_pred = model.predict_one(x["text"])
    y = x["cluster"]
    metric.update(y, y_pred)
    model.learn_one(x["text"])

print(metric)
# AdjustedRand: -0.17647058823529413

Related Pages

Principle:Online_ml_River_TextClust_Clustering

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment