Principle:Rapidsai Cuml Dimensionality Reduction Algorithm Selection

Knowledge Sources	UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction Visualizing Data using t-SNE Principal Component Analysis
Domains	Machine_Learning, Dimensionality_Reduction
Last Updated	2026-02-08 00:00 GMT

Overview

Dimensionality reduction algorithm selection is the process of choosing between linear methods (PCA, SVD) and nonlinear methods (UMAP, t-SNE) based on the structure of the data, the goals of the analysis, and computational constraints.

Description

High-dimensional datasets often contain redundant or correlated features that can be compressed into a lower-dimensional representation without significant loss of information. The choice of dimensionality reduction algorithm depends on several factors:

Linear methods such as PCA (Principal Component Analysis) assume that the most important variation in the data can be captured by linear combinations of the original features. PCA performs an eigendecomposition of the covariance matrix (or equivalently, a Singular Value Decomposition of the data matrix) to find orthogonal directions of maximum variance. These methods are fast, deterministic, and invertible, making them suitable for data compression, noise removal, and preprocessing pipelines.

Nonlinear methods capture complex manifold structure that linear methods miss:

UMAP (Uniform Manifold Approximation and Projection) is grounded in topological data analysis. It constructs a fuzzy simplicial set representation of the high-dimensional data's local neighborhood structure, then optimizes a low-dimensional layout that preserves this topological representation. UMAP balances local and global structure preservation and supports out-of-sample transforms.

t-SNE (t-Distributed Stochastic Neighbor Embedding) converts pairwise distances into conditional probabilities in both high-dimensional and low-dimensional spaces, then minimizes the Kullback-Leibler (KL) divergence between these distributions. It excels at revealing cluster structure for visualization but is transductive (cannot project new points) and does not preserve global distances.

Usage

Use this principle when deciding which dimensionality reduction algorithm to apply:

Choose PCA when you need a fast, deterministic, invertible linear projection, when explained variance is a meaningful metric, or when the data has approximately linear structure. Common use cases include data compression, denoising, and feature extraction for downstream models.
Choose UMAP when you need to preserve both local and global manifold structure, when you require out-of-sample projection, or when working with large datasets where performance matters. UMAP is suitable for both visualization and as a general-purpose embedding.
Choose t-SNE when the primary goal is 2D/3D visualization of cluster structure and local neighborhoods. t-SNE is particularly effective at revealing well-separated clusters but should not be used when out-of-sample projection is needed.

Theoretical Basis

PCA: Eigendecomposition of the Covariance Matrix

Given a data matrix $X \in ℝ^{n \times d}$ (mean-centered), PCA finds the directions of maximum variance by solving:

$Σ = \frac{1}{n - 1} X^{T} X$

$Σ v_{i} = λ_{i} v_{i}$

where $Σ$ is the covariance matrix, $v_{i}$ are the principal components (eigenvectors), and $λ_{i}$ are the explained variances (eigenvalues). The top $k$ components define the projection $Z = X V_{k}$ .

UMAP: Fuzzy Simplicial Sets

UMAP constructs a weighted k-neighbor graph in high-dimensional space, where edge weights represent the probability that two points are connected. The algorithm:

Computes k-nearest neighbors and local distances $ρ_{i}$ and bandwidth $σ_{i}$ .
Constructs a fuzzy simplicial set by combining local metric spaces using fuzzy set union.
Optimizes a low-dimensional embedding by minimizing the cross-entropy between the high-dimensional and low-dimensional fuzzy simplicial sets:

$C = \sum_{e \in E} [w_{h} (e) \log \frac{w_{h} (e)}{w_{l} (e)} + (1 - w_{h} (e)) \log \frac{1 - w_{h} (e)}{1 - w_{l} (e)}]$

t-SNE: KL Divergence Minimization

t-SNE converts pairwise Euclidean distances into joint probabilities $p_{i j}$ in high-dimensional space (Gaussian kernel) and $q_{i j}$ in low-dimensional space (Student's t-distribution with one degree of freedom):

$p_{j | i} = \frac{\exp (- ‖ x_{i} - x_{j} ‖^{2} / 2 σ_{i}^{2})}{\sum_{k \neq i} \exp (- ‖ x_{i} - x_{k} ‖^{2} / 2 σ_{i}^{2})}$

$q_{i j} = \frac{(1 + ‖ y_{i} - y_{j} ‖^{2})^{- 1}}{\sum_{k \neq l} (1 + ‖ y_{k} - y_{l} ‖^{2})^{- 1}}$

The embedding is found by minimizing the KL divergence:

$K L (P ‖ Q) = \sum_{i \neq j} p_{i j} \log \frac{p_{i j}}{q_{i j}}$

Algorithm Selection Decision Pseudocode

function select_algorithm(data, goal):
    if goal == "compression" or goal == "preprocessing":
        return PCA  # Fast, invertible, deterministic
    elif goal == "visualization":
        if need_out_of_sample_transform:
            return UMAP
        elif dataset_size > 50000:
            return UMAP  # Scales better than t-SNE
        else:
            return TSNE  # Best cluster separation for visualization
    elif goal == "manifold_learning":
        if need_transform_on_new_data:
            return UMAP
        else:
            return UMAP or TSNE  # Both capture nonlinear structure

Related Pages

Implemented By

Implementation:Rapidsai_Cuml_PCA_UMAP_TSNE_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment