Principle:Online ml River TextClust Clustering
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| River River Docs Textual One-Pass Stream Clustering with Automated Distance Threshold Adaption (Assenmacher and Trautmann, 2022) Stream Clustering of Chat Messages with Applications to Twitch Streams (Carnein, Assenmacher, Trautmann, 2017) | Online Clustering, Text Mining, NLP, Streaming Algorithms | 2026-02-08 16:00 GMT |
Overview
TextClust is an online clustering algorithm specialized for text data streams that uses TF-IDF-weighted micro-clusters with automatic radius adjustment and macro-clustering for discovering and tracking evolving topics.
Description
TextClust is designed specifically for clustering textual data streams, such as social media posts, chat messages, or news articles. It represents each micro-cluster as a weighted TF-IDF feature vector, enabling it to capture the semantic content of text clusters.
The algorithm follows a two-phase approach:
Online Phase (Micro-cluster Maintenance): Each incoming text document is represented as a bag-of-words dictionary. The algorithm computes the TF-IDF cosine distance between the new document and all existing micro-clusters. If the nearest micro-cluster is within the radius threshold (or the automatically determined threshold when auto_r=True), the document is merged into that micro-cluster. Otherwise, a new micro-cluster is created. Micro-cluster weights fade over time via the fading_factor, and a periodic cleanup every tgap steps removes weak micro-clusters and optionally merges close ones (when auto_merge=True).
Offline Phase (Macro-clustering): Upon request, the micro-clusters are reclustered using single-linkage agglomerative hierarchical clustering to produce num_macro final clusters. This phase uses the macro-distance metric (by default, TF-IDF cosine distance) to build a distance matrix and greedily merge the closest cluster pairs.
A distinctive feature is automatic radius adjustment (auto_r): instead of using a fixed distance threshold, the algorithm dynamically computes a threshold based on the mean and standard deviation of distances to existing micro-clusters, allowing it to adapt to varying text densities.
Usage
Use TextClust when:
- Your data is a stream of text documents (tweets, messages, articles, etc.).
- You want to identify and track topics as they evolve over time.
- The text is represented as bag-of-words features (typically via
feature_extraction.BagOfWordsin a pipeline). - You need adaptive thresholding for varying text density in the stream.
TextClust is commonly used in a River compose.Pipeline preceded by a feature_extraction.BagOfWords transformer.
Theoretical Basis
Distance Metric -- TF-IDF Cosine Distance:
For two micro-clusters A and B with term frequency vectors and a global IDF dictionary:
cosine_similarity(A, B) = SUM_k (A.tf[k] * idf[k]) * (B.tf[k] * idf[k])
/ (||A_tfidf|| * ||B_tfidf||)
tfidf_cosine_distance(A, B) = 1 - cosine_similarity(A, B)
Where IDF is computed from the micro-cluster collection:
idf[k] = 1 + log(N_microclusters / df[k])
Automatic Radius Adjustment:
When auto_r=True, the threshold is computed adaptively for each incoming document:
mu = (sum_distances - min_distance) / (num_clusters - 1)
threshold = mu - sigma * sqrt(square_sum / (num_clusters - 1) - mu^2)
IF min_distance < threshold:
Merge into nearest micro-cluster
ELSE:
Create new micro-cluster
Fading Strategy:
Micro-cluster weights decay exponentially:
weight_new = weight * 2^(-fading_factor * (t_now - t_last))
When term_fading=True, individual term frequencies within micro-clusters also fade, and terms whose TF falls below omega = 2^(-fading_factor * tgap) are removed entirely.
Macro-clustering:
Single-linkage agglomerative clustering is applied on the micro-cluster distance matrix until num_macro clusters remain. At each step, the two clusters with the smallest minimum pairwise distance are merged.