Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Rapidsai Cuml Dimensionality Reduction Algorithm Selection

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Dimensionality_Reduction
Last Updated 2026-02-08 00:00 GMT

Overview

Dimensionality reduction algorithm selection is the process of choosing between linear methods (PCA, SVD) and nonlinear methods (UMAP, t-SNE) based on the structure of the data, the goals of the analysis, and computational constraints.

Description

High-dimensional datasets often contain redundant or correlated features that can be compressed into a lower-dimensional representation without significant loss of information. The choice of dimensionality reduction algorithm depends on several factors:

Linear methods such as PCA (Principal Component Analysis) assume that the most important variation in the data can be captured by linear combinations of the original features. PCA performs an eigendecomposition of the covariance matrix (or equivalently, a Singular Value Decomposition of the data matrix) to find orthogonal directions of maximum variance. These methods are fast, deterministic, and invertible, making them suitable for data compression, noise removal, and preprocessing pipelines.

Nonlinear methods capture complex manifold structure that linear methods miss:

  • UMAP (Uniform Manifold Approximation and Projection) is grounded in topological data analysis. It constructs a fuzzy simplicial set representation of the high-dimensional data's local neighborhood structure, then optimizes a low-dimensional layout that preserves this topological representation. UMAP balances local and global structure preservation and supports out-of-sample transforms.
  • t-SNE (t-Distributed Stochastic Neighbor Embedding) converts pairwise distances into conditional probabilities in both high-dimensional and low-dimensional spaces, then minimizes the Kullback-Leibler (KL) divergence between these distributions. It excels at revealing cluster structure for visualization but is transductive (cannot project new points) and does not preserve global distances.

Usage

Use this principle when deciding which dimensionality reduction algorithm to apply:

  • Choose PCA when you need a fast, deterministic, invertible linear projection, when explained variance is a meaningful metric, or when the data has approximately linear structure. Common use cases include data compression, denoising, and feature extraction for downstream models.
  • Choose UMAP when you need to preserve both local and global manifold structure, when you require out-of-sample projection, or when working with large datasets where performance matters. UMAP is suitable for both visualization and as a general-purpose embedding.
  • Choose t-SNE when the primary goal is 2D/3D visualization of cluster structure and local neighborhoods. t-SNE is particularly effective at revealing well-separated clusters but should not be used when out-of-sample projection is needed.

Theoretical Basis

PCA: Eigendecomposition of the Covariance Matrix

Given a data matrix Xn×d (mean-centered), PCA finds the directions of maximum variance by solving:

Σ=1n1XTX

Σvi=λivi

where Σ is the covariance matrix, vi are the principal components (eigenvectors), and λi are the explained variances (eigenvalues). The top k components define the projection Z=XVk.

UMAP: Fuzzy Simplicial Sets

UMAP constructs a weighted k-neighbor graph in high-dimensional space, where edge weights represent the probability that two points are connected. The algorithm:

  1. Computes k-nearest neighbors and local distances ρi and bandwidth σi.
  2. Constructs a fuzzy simplicial set by combining local metric spaces using fuzzy set union.
  3. Optimizes a low-dimensional embedding by minimizing the cross-entropy between the high-dimensional and low-dimensional fuzzy simplicial sets:

C=eE[wh(e)logwh(e)wl(e)+(1wh(e))log1wh(e)1wl(e)]

t-SNE: KL Divergence Minimization

t-SNE converts pairwise Euclidean distances into joint probabilities pij in high-dimensional space (Gaussian kernel) and qij in low-dimensional space (Student's t-distribution with one degree of freedom):

pj|i=exp(xixj2/2σi2)kiexp(xixk2/2σi2)

qij=(1+yiyj2)1kl(1+ykyl2)1

The embedding is found by minimizing the KL divergence:

KL(PQ)=ijpijlogpijqij

Algorithm Selection Decision Pseudocode

function select_algorithm(data, goal):
    if goal == "compression" or goal == "preprocessing":
        return PCA  # Fast, invertible, deterministic
    elif goal == "visualization":
        if need_out_of_sample_transform:
            return UMAP
        elif dataset_size > 50000:
            return UMAP  # Scales better than t-SNE
        else:
            return TSNE  # Best cluster separation for visualization
    elif goal == "manifold_learning":
        if need_transform_on_new_data:
            return UMAP
        else:
            return UMAP or TSNE  # Both capture nonlinear structure

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment