Implementation:Online ml River Cluster TextClust
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs Textual One-Pass Stream Clustering with Automated Distance Threshold Adaption (Assenmacher and Trautmann, 2022) Stream Clustering of Chat Messages with Applications to Twitch Streams (Carnein, Assenmacher, Trautmann, 2017) | Online Clustering, Text Mining, NLP | 2026-02-08 16:00 GMT |
Overview
Concrete tool for performing online clustering of text data streams using TF-IDF-weighted micro-clusters, with optional automatic radius adaptation and periodic macro-clustering via agglomerative hierarchical clustering.
Description
The cluster.TextClust class implements the TextClust algorithm for streaming text clustering. It maintains a dictionary of micro-clusters, each storing TF-IDF term frequency vectors and weights. For each incoming document (represented as a bag-of-words dictionary), the algorithm computes TF-IDF cosine distances to all existing micro-clusters and either merges the document into the nearest one or creates a new micro-cluster.
The class supports automatic radius adjustment via the auto_r parameter, term-level fading via term_fading, and automatic merging of close micro-clusters during cleanup via auto_merge. Macro-clustering is performed on demand using single-linkage agglomerative hierarchical clustering to produce num_macro output clusters.
TextClust is typically used within a compose.Pipeline with a preceding feature_extraction.BagOfWords transformer that converts raw text strings into bag-of-words dictionaries.
Usage
Import cluster.TextClust when you need to cluster a stream of text documents and track evolving topics. Place it after a bag-of-words feature extractor in a pipeline.
Code Reference
Source Location
river/cluster/textclust.py:L13-L623
Signature
class TextClust(base.Clusterer):
def __init__(
self,
radius=0.3,
fading_factor=0.0005,
tgap=100,
term_fading=True,
real_time_fading=True,
micro_distance="tfidf_cosine_distance",
macro_distance="tfidf_cosine_distance",
num_macro=3,
min_weight=0,
auto_r=False,
auto_merge=True,
sigma=1
)
Import
from river import cluster
Key Parameters
| Parameter | Default | Description |
|---|---|---|
| radius | 0.3 | Distance threshold to merge a new document into an existing micro-cluster. Must be in (0, 1]. |
| fading_factor | 0.0005 | Exponential fading factor for micro-cluster weights. |
| tgap | 100 | Time interval between outlier removal / cleanup operations. |
| term_fading | True | Whether individual term frequencies should also be faded over time. |
| real_time_fading | True | Whether to use wall-clock time or observation count for fading. |
| micro_distance | "tfidf_cosine_distance" | Distance metric for micro-cluster assignment. |
| macro_distance | "tfidf_cosine_distance" | Distance metric for macro-cluster generation. |
| num_macro | 3 | Number of macro-clusters to produce during reclustering. |
| min_weight | 0 | Minimum weight threshold for micro-clusters to participate in reclustering. |
| auto_r | False | Whether to automatically adapt the distance radius threshold. |
| auto_merge | True | Whether to merge close micro-clusters during cleanup. |
| sigma | 1 | Influences the automatic threshold adaptation technique. |
Methods
| Method | Signature | Description |
|---|---|---|
| learn_one | learn_one(x: dict, t=None, w=None) -> None |
Processes a new bag-of-words document x, merging into the nearest micro-cluster or creating a new one. |
| predict_one | predict_one(x: dict, w=None, type="micro") -> int |
Returns the micro-cluster or macro-cluster assignment for document x. |
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
| x | dict |
A bag-of-words dictionary mapping terms to their counts/weights. |
Outputs
| Output | Type | Description |
|---|---|---|
| predict_one return | int |
The cluster ID (micro-cluster or macro-cluster depending on type parameter) assigned to the document.
|
Usage Examples
from river import compose
from river import feature_extraction
from river import metrics
from river import cluster
corpus = [
{"text": 'This is the first document.', "cluster": 1},
{"text": 'This document is the second document.', "cluster": 1},
{"text": 'And this is super unrelated.', "cluster": 2},
{"text": 'Is this the first document?', "cluster": 1},
{"text": 'This is super unrelated as well', "cluster": 2},
{"text": 'Test text', "cluster": 5}
]
stopwords = ['stop', 'the', 'to', 'and', 'a', 'in', 'it', 'is', 'I']
metric = metrics.AdjustedRand()
model = compose.Pipeline(
feature_extraction.BagOfWords(
lowercase=True,
ngram_range=(1, 2),
stop_words=stopwords
),
cluster.TextClust(
real_time_fading=False,
fading_factor=0.001,
tgap=100,
auto_r=True,
radius=0.9
)
)
for x in corpus:
y_pred = model.predict_one(x["text"])
y = x["cluster"]
metric.update(y, y_pred)
model.learn_one(x["text"])
print(metric)
# AdjustedRand: -0.17647058823529413