Principle:Scikit learn Scikit learn Dimensionality Reduction

Knowledge Sources	Scikit_learn Scikit-learn Docs
Domains	Unsupervised Learning, Feature Engineering
Last Updated	2026-02-08 15:00 GMT

Overview

Dimensionality reduction transforms high-dimensional data into a lower-dimensional representation while preserving as much meaningful structure as possible.

Description

Dimensionality reduction techniques address the curse of dimensionality by projecting data from a high-dimensional feature space into a lower-dimensional subspace. They solve the problems of computational cost, overfitting, and difficulty of visualization that arise when working with many features. These methods can be broadly categorized into linear techniques (which find linear projections) and non-linear techniques (which capture more complex structure). Dimensionality reduction sits within both unsupervised learning and feature engineering pipelines.

Usage

Use dimensionality reduction when the number of features is large relative to the number of samples, when you need to visualize high-dimensional data, or when downstream models suffer from overfitting due to excessive features. PCA is the default first choice for general-purpose linear reduction. Use NMF when data is non-negative and parts-based decomposition is meaningful (e.g., topic modeling, image decomposition). Use ICA when the goal is to recover statistically independent source signals. Use Truncated SVD for sparse data (e.g., text corpora) where centering is impractical.

Theoretical Basis

Principal Component Analysis (PCA) finds orthogonal directions of maximum variance. Given centered data matrix $X$ , PCA computes the eigendecomposition of the covariance matrix:

$Σ = \frac{1}{n - 1} X^{T} X = V Λ V^{T}$

The top $k$ eigenvectors form the projection matrix, and the projected data is $Z = X V_{k}$ . The fraction of variance retained is $\sum_{i = 1}^{k} λ_{i} / \sum_{i = 1}^{d} λ_{i}$ .

Singular Value Decomposition (SVD) decomposes $X = U Σ V^{T}$ . Truncated SVD retains only the top $k$ singular values, yielding the best rank- $k$ approximation in the Frobenius norm. This is especially useful for sparse matrices since it does not require centering.

Non-negative Matrix Factorization (NMF) approximates $X \approx W H$ where $X, W, H \geq 0$ . The non-negativity constraint produces additive, parts-based representations. The objective minimizes:

$‖ X - W H ‖_{F}^{2}$

subject to $W \geq 0, H \geq 0$ .

Independent Component Analysis (ICA) assumes the observed data is a linear mixture of statistically independent sources: $X = A S$ . FastICA recovers the unmixing matrix $W$ such that $S = W X$ by maximizing the non-Gaussianity of the recovered components.

Incremental PCA processes data in mini-batches, enabling PCA on datasets that do not fit in memory.

Kernel PCA applies the kernel trick to perform PCA in a high-dimensional feature space implicitly defined by a kernel function $k (x_{i}, x_{j})$ , capturing non-linear structure.

Sparse PCA adds an $ℓ_{1}$ penalty to the components to produce sparse loadings, yielding more interpretable principal components.

Dictionary Learning finds a sparse representation of data in terms of an overcomplete basis (dictionary), minimizing $‖ X - D A ‖_{F}^{2} + α ‖ A ‖_{1}$ .

Factor Analysis models observed variables as linear combinations of latent factors plus Gaussian noise with a diagonal covariance, distinguishing shared variance from variable-specific variance.

Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data (e.g., text corpora) that represents each document as a mixture of topics and each topic as a distribution over words.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment