Implementation:Scikit learn Scikit learn TruncatedSVD
| Knowledge Sources | |
|---|---|
| Domains | Dimensionality Reduction, Natural Language Processing |
| Last Updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for dimensionality reduction using truncated SVD (also known as Latent Semantic Analysis) provided by scikit-learn.
Description
TruncatedSVD performs linear dimensionality reduction by means of truncated singular value decomposition (SVD). Unlike PCA, it does not center the data before computing the SVD, which allows it to work efficiently with sparse matrices. It supports two algorithms: a fast randomized SVD solver and a "naive" ARPACK-based eigensolver. In the context of text analysis with tf-idf matrices, truncated SVD is known as Latent Semantic Analysis (LSA).
Usage
Use TruncatedSVD when working with sparse data, particularly term-frequency or tf-idf matrices from text processing pipelines. It is the standard technique for Latent Semantic Analysis (LSA) in information retrieval, document similarity, and text classification. Also useful as a general-purpose dimensionality reduction tool when the input is sparse and centering would destroy sparsity.
Code Reference
Source Location
- Repository: scikit-learn
- File: sklearn/decomposition/_truncated_svd.py
Signature
class TruncatedSVD(ClassNamePrefixFeaturesOutMixin, TransformerMixin, BaseEstimator):
def __init__(
self,
n_components=2,
*,
algorithm="randomized",
n_iter=5,
n_oversamples=10,
power_iteration_normalizer="auto",
random_state=None,
tol=0.0,
):
Import
from sklearn.decomposition import TruncatedSVD
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| n_components | int | No | Desired dimensionality of output data (default=2). For LSA, 100 is recommended. |
| algorithm | str | No | SVD solver: 'arpack' or 'randomized' (default='randomized'). |
| n_iter | int | No | Number of iterations for randomized SVD solver (default=5). |
| n_oversamples | int | No | Number of oversamples for randomized SVD solver (default=10). |
| power_iteration_normalizer | str | No | Normalizer: 'auto', 'QR', 'LU', or 'none' (default='auto'). |
| random_state | int or RandomState | No | Random state for reproducibility. |
| tol | float | No | Tolerance for ARPACK (default=0.0). |
Outputs
| Name | Type | Description |
|---|---|---|
| components_ | ndarray of shape (n_components, n_features) | The right singular vectors of the input data (the V^T in X = U S V^T). |
| explained_variance_ | ndarray of shape (n_components,) | Variance of the training data projected onto each component. |
| explained_variance_ratio_ | ndarray of shape (n_components,) | Ratio of variance explained by each component. |
| singular_values_ | ndarray of shape (n_components,) | Singular values corresponding to each selected component. |
| n_features_in_ | int | Number of features seen during fit. |
Usage Examples
Basic Usage
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [
"the cat sat on the mat",
"the dog sat on the log",
"cats and dogs are friends",
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
svd = TruncatedSVD(n_components=2, random_state=42)
X_reduced = svd.fit_transform(X)
print(X_reduced.shape) # (3, 2)
print(svd.explained_variance_ratio_)