Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Online ml River Preprocessing LDA

From Leeroopedia


Knowledge Sources
Domains Online_Learning, Preprocessing, Topic_Modeling
Last Updated 2026-02-08 16:00 GMT

Overview

Online Latent Dirichlet Allocation (LDA) with infinite vocabulary for streaming topic modeling.

Description

This implementation provides an online variant of Latent Dirichlet Allocation that can handle an infinite vocabulary, meaning the set of tokens does not need to be known in advance. It uses variational inference to incrementally update document-topic and word-topic distributions as new documents arrive. The implementation maintains running statistics for topics and automatically prunes the vocabulary when it exceeds a maximum size. It assumes documents are represented as token counts (bag-of-words format).

Usage

Use this when you need to extract latent topics from streaming text data without knowing the complete vocabulary upfront. Particularly useful for applications where the vocabulary is large or evolving over time. Best combined with preprocessing steps like BagOfWords for tokenization. Suitable for document categorization, content recommendation, and exploratory text analysis in online settings.

Code Reference

Source Location

Signature

class LDA(base.Transformer):
    def __init__(
        self,
        n_components=10,
        number_of_documents=1e6,
        alpha_theta=0.5,
        alpha_beta=100.0,
        tau=64.0,
        kappa=0.75,
        vocab_prune_interval=10,
        number_of_samples=10,
        ranking_smooth_factor=1e-12,
        burn_in_sweeps=5,
        maximum_size_vocabulary=4000,
        seed: int | None = None,
    )

Import

from river import preprocessing

I/O Contract

Input Output
Dict[str, int] - Token counts Dict[int, float] - Topic distributions

Usage Examples

from river import compose
from river import feature_extraction
from river import preprocessing

X = [
   'weather cold',
   'weather hot dry',
   'weather cold rainy',
   'weather hot',
   'weather cold humid',
]

lda = compose.Pipeline(
    feature_extraction.BagOfWords(),
    preprocessing.LDA(
        n_components=2,
        number_of_documents=60,
        seed=42
    )
)

for x in X:
    lda.learn_one(x)
    topics = lda.transform_one(x)
    print(topics)
# {0: 0.5, 1: 2.5}
# {0: 2.499..., 1: 1.5}
# {0: 0.5, 1: 3.5}
# {0: 0.5, 1: 2.5}
# {0: 1.5, 1: 2.5}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment