Implementation:Online ml River Preprocessing LDA

Knowledge Sources	Online_ml_River
Domains	Online_Learning, Preprocessing, Topic_Modeling
Last Updated	2026-02-08 16:00 GMT

Overview

Online Latent Dirichlet Allocation (LDA) with infinite vocabulary for streaming topic modeling.

Description

This implementation provides an online variant of Latent Dirichlet Allocation that can handle an infinite vocabulary, meaning the set of tokens does not need to be known in advance. It uses variational inference to incrementally update document-topic and word-topic distributions as new documents arrive. The implementation maintains running statistics for topics and automatically prunes the vocabulary when it exceeds a maximum size. It assumes documents are represented as token counts (bag-of-words format).

Usage

Use this when you need to extract latent topics from streaming text data without knowing the complete vocabulary upfront. Particularly useful for applications where the vocabulary is large or evolving over time. Best combined with preprocessing steps like BagOfWords for tokenization. Suitable for document categorization, content recommendation, and exploratory text analysis in online settings.

Code Reference

Source Location

Repository: Online_ml_River
File: river/preprocessing/lda.py

Signature

class LDA(base.Transformer):
    def __init__(
        self,
        n_components=10,
        number_of_documents=1e6,
        alpha_theta=0.5,
        alpha_beta=100.0,
        tau=64.0,
        kappa=0.75,
        vocab_prune_interval=10,
        number_of_samples=10,
        ranking_smooth_factor=1e-12,
        burn_in_sweeps=5,
        maximum_size_vocabulary=4000,
        seed: int | None = None,
    )

Import

from river import preprocessing

I/O Contract

Input	Output
Dict[str, int] - Token counts	Dict[int, float] - Topic distributions

Usage Examples

from river import compose
from river import feature_extraction
from river import preprocessing

X = [
   'weather cold',
   'weather hot dry',
   'weather cold rainy',
   'weather hot',
   'weather cold humid',
]

lda = compose.Pipeline(
    feature_extraction.BagOfWords(),
    preprocessing.LDA(
        n_components=2,
        number_of_documents=60,
        seed=42
    )
)

for x in X:
    lda.learn_one(x)
    topics = lda.transform_one(x)
    print(topics)
# {0: 0.5, 1: 2.5}
# {0: 2.499..., 1: 1.5}
# {0: 0.5, 1: 3.5}
# {0: 0.5, 1: 2.5}
# {0: 1.5, 1: 2.5}

Related Pages

Environment:Online_ml_River_Python_Runtime_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment