Principle:Rapidsai Cuml Dimensionality Reduction Algorithm Selection
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Dimensionality_Reduction |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Dimensionality reduction algorithm selection is the process of choosing between linear methods (PCA, SVD) and nonlinear methods (UMAP, t-SNE) based on the structure of the data, the goals of the analysis, and computational constraints.
Description
High-dimensional datasets often contain redundant or correlated features that can be compressed into a lower-dimensional representation without significant loss of information. The choice of dimensionality reduction algorithm depends on several factors:
Linear methods such as PCA (Principal Component Analysis) assume that the most important variation in the data can be captured by linear combinations of the original features. PCA performs an eigendecomposition of the covariance matrix (or equivalently, a Singular Value Decomposition of the data matrix) to find orthogonal directions of maximum variance. These methods are fast, deterministic, and invertible, making them suitable for data compression, noise removal, and preprocessing pipelines.
Nonlinear methods capture complex manifold structure that linear methods miss:
- UMAP (Uniform Manifold Approximation and Projection) is grounded in topological data analysis. It constructs a fuzzy simplicial set representation of the high-dimensional data's local neighborhood structure, then optimizes a low-dimensional layout that preserves this topological representation. UMAP balances local and global structure preservation and supports out-of-sample transforms.
- t-SNE (t-Distributed Stochastic Neighbor Embedding) converts pairwise distances into conditional probabilities in both high-dimensional and low-dimensional spaces, then minimizes the Kullback-Leibler (KL) divergence between these distributions. It excels at revealing cluster structure for visualization but is transductive (cannot project new points) and does not preserve global distances.
Usage
Use this principle when deciding which dimensionality reduction algorithm to apply:
- Choose PCA when you need a fast, deterministic, invertible linear projection, when explained variance is a meaningful metric, or when the data has approximately linear structure. Common use cases include data compression, denoising, and feature extraction for downstream models.
- Choose UMAP when you need to preserve both local and global manifold structure, when you require out-of-sample projection, or when working with large datasets where performance matters. UMAP is suitable for both visualization and as a general-purpose embedding.
- Choose t-SNE when the primary goal is 2D/3D visualization of cluster structure and local neighborhoods. t-SNE is particularly effective at revealing well-separated clusters but should not be used when out-of-sample projection is needed.
Theoretical Basis
PCA: Eigendecomposition of the Covariance Matrix
Given a data matrix (mean-centered), PCA finds the directions of maximum variance by solving:
where is the covariance matrix, are the principal components (eigenvectors), and are the explained variances (eigenvalues). The top components define the projection .
UMAP: Fuzzy Simplicial Sets
UMAP constructs a weighted k-neighbor graph in high-dimensional space, where edge weights represent the probability that two points are connected. The algorithm:
- Computes k-nearest neighbors and local distances and bandwidth .
- Constructs a fuzzy simplicial set by combining local metric spaces using fuzzy set union.
- Optimizes a low-dimensional embedding by minimizing the cross-entropy between the high-dimensional and low-dimensional fuzzy simplicial sets:
t-SNE: KL Divergence Minimization
t-SNE converts pairwise Euclidean distances into joint probabilities in high-dimensional space (Gaussian kernel) and in low-dimensional space (Student's t-distribution with one degree of freedom):
The embedding is found by minimizing the KL divergence:
Algorithm Selection Decision Pseudocode
function select_algorithm(data, goal):
if goal == "compression" or goal == "preprocessing":
return PCA # Fast, invertible, deterministic
elif goal == "visualization":
if need_out_of_sample_transform:
return UMAP
elif dataset_size > 50000:
return UMAP # Scales better than t-SNE
else:
return TSNE # Best cluster separation for visualization
elif goal == "manifold_learning":
if need_transform_on_new_data:
return UMAP
else:
return UMAP or TSNE # Both capture nonlinear structure