Implementation:Online ml River Preprocessing LDA
| Knowledge Sources | |
|---|---|
| Domains | Online_Learning, Preprocessing, Topic_Modeling |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Online Latent Dirichlet Allocation (LDA) with infinite vocabulary for streaming topic modeling.
Description
This implementation provides an online variant of Latent Dirichlet Allocation that can handle an infinite vocabulary, meaning the set of tokens does not need to be known in advance. It uses variational inference to incrementally update document-topic and word-topic distributions as new documents arrive. The implementation maintains running statistics for topics and automatically prunes the vocabulary when it exceeds a maximum size. It assumes documents are represented as token counts (bag-of-words format).
Usage
Use this when you need to extract latent topics from streaming text data without knowing the complete vocabulary upfront. Particularly useful for applications where the vocabulary is large or evolving over time. Best combined with preprocessing steps like BagOfWords for tokenization. Suitable for document categorization, content recommendation, and exploratory text analysis in online settings.
Code Reference
Source Location
- Repository: Online_ml_River
- File: river/preprocessing/lda.py
Signature
class LDA(base.Transformer):
def __init__(
self,
n_components=10,
number_of_documents=1e6,
alpha_theta=0.5,
alpha_beta=100.0,
tau=64.0,
kappa=0.75,
vocab_prune_interval=10,
number_of_samples=10,
ranking_smooth_factor=1e-12,
burn_in_sweeps=5,
maximum_size_vocabulary=4000,
seed: int | None = None,
)
Import
from river import preprocessing
I/O Contract
| Input | Output |
|---|---|
| Dict[str, int] - Token counts | Dict[int, float] - Topic distributions |
Usage Examples
from river import compose
from river import feature_extraction
from river import preprocessing
X = [
'weather cold',
'weather hot dry',
'weather cold rainy',
'weather hot',
'weather cold humid',
]
lda = compose.Pipeline(
feature_extraction.BagOfWords(),
preprocessing.LDA(
n_components=2,
number_of_documents=60,
seed=42
)
)
for x in X:
lda.learn_one(x)
topics = lda.transform_one(x)
print(topics)
# {0: 0.5, 1: 2.5}
# {0: 2.499..., 1: 1.5}
# {0: 0.5, 1: 3.5}
# {0: 0.5, 1: 2.5}
# {0: 1.5, 1: 2.5}