Implementation:Scikit learn Scikit learn LatentDirichletAllocation
| Knowledge Sources | |
|---|---|
| Domains | Topic Modeling, Natural Language Processing |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for topic modeling using Latent Dirichlet Allocation with online variational Bayes provided by scikit-learn.
Description
LatentDirichletAllocation implements the online variational Bayes algorithm for Latent Dirichlet Allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. It discovers abstract topics in a set of documents by learning a topic-word distribution and a document-topic distribution. The implementation supports both batch and online learning methods, with the online method being much faster for large datasets.
Usage
Use LatentDirichletAllocation when you need to discover hidden topics in a collection of documents or any discrete count data. It is commonly applied to text mining, document clustering, and content recommendation systems where understanding the thematic structure of a corpus is important.
Code Reference
Source Location
- Repository: scikit-learn
- File: sklearn/decomposition/_lda.py
Signature
class LatentDirichletAllocation(
ClassNamePrefixFeaturesOutMixin, TransformerMixin, BaseEstimator
):
def __init__(
self,
n_components=10,
*,
doc_topic_prior=None,
topic_word_prior=None,
learning_method="batch",
learning_decay=0.7,
learning_offset=10.0,
max_iter=10,
batch_size=128,
evaluate_every=-1,
total_samples=1e6,
perp_tol=1e-1,
mean_change_tol=1e-3,
max_doc_update_iter=100,
n_jobs=None,
verbose=0,
random_state=None,
):
Import
from sklearn.decomposition import LatentDirichletAllocation
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| n_components | int | No | Number of topics (default=10). |
| doc_topic_prior | float | No | Prior of document topic distribution (alpha). Defaults to 1/n_components. |
| topic_word_prior | float | No | Prior of topic word distribution (eta). Defaults to 1/n_components. |
| learning_method | str | No | Method to update components: 'batch' or 'online' (default='batch'). |
| learning_decay | float | No | Controls learning rate in online learning (default=0.7). |
| learning_offset | float | No | Downweights early iterations in online learning (default=10.0). |
| max_iter | int | No | Maximum number of passes over the training data (default=10). |
| batch_size | int | No | Number of documents per EM iteration in online learning (default=128). |
| evaluate_every | int | No | How often to evaluate perplexity during training (default=-1). |
| n_jobs | int | No | Number of parallel jobs. |
| random_state | int or RandomState | No | Random state for reproducibility. |
Outputs
| Name | Type | Description |
|---|---|---|
| components_ | ndarray of shape (n_components, n_features) | Variational parameters for topic-word distribution (unnormalized). |
| exp_dirichlet_component_ | ndarray of shape (n_components, n_features) | Exponential of the expectation of log topic-word distribution. |
| n_batch_iter_ | int | Number of mini-batch iterations. |
| n_iter_ | int | Number of passes over the dataset. |
| bound_ | float | Final perplexity score on training set. |
| n_features_in_ | int | Number of features seen during fit. |
Usage Examples
Basic Usage
import numpy as np
from sklearn.decomposition import LatentDirichletAllocation
# Simulate a document-term matrix (5 documents, 10 terms)
rng = np.random.RandomState(0)
X = rng.randint(0, 10, size=(5, 10)).astype(float)
lda = LatentDirichletAllocation(n_components=3, random_state=0)
X_topics = lda.fit_transform(X)
print(X_topics.shape) # (5, 3)
print(lda.components_.shape) # (3, 10)