Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Rapidsai Cuml Dimensionality Reduction

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Dimensionality_Reduction, Data_Visualization, GPU_Computing
Last Updated 2026-02-08 12:00 GMT

Overview

End-to-end process for GPU-accelerated dimensionality reduction using cuML's PCA, UMAP, and t-SNE implementations for data visualization, feature compression, and manifold learning.

Description

This workflow covers the standard procedure for reducing high-dimensional data to lower dimensions using NVIDIA GPU acceleration. It supports three complementary approaches: PCA for linear dimensionality reduction preserving global variance structure, UMAP for non-linear manifold learning preserving both local and global structure, and t-SNE for high-quality 2D visualization emphasizing local neighborhood relationships. All three algorithms follow the scikit-learn fit/transform API pattern and accept cuDF DataFrames, CuPy arrays, NumPy arrays, and sparse matrices as input.

Usage

Execute this workflow when you have high-dimensional data and need to reduce it for visualization, feature extraction, or as a preprocessing step for downstream ML models. Use PCA when linear relationships dominate and you need fast, deterministic reduction with inverse transform capability. Use UMAP when you need to preserve manifold structure and want to transform new data points. Use t-SNE when creating publication-quality 2D visualizations where local neighborhood fidelity is paramount.

Execution Steps

Step 1: Data Preparation

Load the dataset into GPU memory. Ensure all features are numeric. Apply feature scaling (StandardScaler or similar) which is important for distance-based methods like UMAP and t-SNE. For PCA, centering is handled automatically but pre-scaling ensures equal feature contributions.

Key considerations:

  • All three algorithms accept dense arrays and sparse matrices
  • Feature scaling significantly impacts UMAP and t-SNE embedding quality
  • For very large datasets, consider subsampling for t-SNE or using UMAP's approximate KNN

Step 2: Algorithm Selection

Choose the dimensionality reduction algorithm based on the analysis goal and data characteristics.

PCA: Best for linear structure preservation, data compression, and noise reduction. Produces deterministic results. Supports `inverse_transform()` for reconstruction. Choose `svd_solver='full'` for accuracy or `'jacobi'` for speed.

UMAP: Best for manifold learning and non-linear structure visualization. Supports `transform()` for new data. Key parameters are `n_neighbors` (local structure granularity), `min_dist` (embedding point spacing), and `metric` (distance function).

t-SNE: Best for 2D visualization with strong local structure preservation. Does not support `transform()` on new data. Choose `method='fft'` (fastest), `'barnes_hut'` (classic), or `'exact'` (most accurate). Only supports `n_components=2`.

Step 3: Model Fitting

Call `fit()` or `fit_transform()` on the chosen estimator with the prepared data. The GPU kernel computes the low-dimensional embedding. For PCA, this involves eigendecomposition of the covariance matrix. For UMAP, this builds a KNN graph, constructs a fuzzy simplicial set, and optimizes the embedding layout. For t-SNE, this computes pairwise similarities and optimizes the embedding using gradient descent.

Key considerations:

  • PCA is deterministic; UMAP and t-SNE have stochastic elements (use `random_state` for reproducibility)
  • UMAP's `build_algo` parameter controls KNN graph construction: 'brute_force_knn' for small datasets, 'nn_descent' for large ones, 'auto' for automatic selection
  • t-SNE's `perplexity` parameter controls the balance between local and global structure

Step 4: Embedding Extraction

Retrieve the low-dimensional representation from the model. For all three algorithms, the embedding is available via the `embedding_` attribute or as the return value of `fit_transform()`. For PCA, also examine `explained_variance_ratio_` to assess how much variance is captured.

What happens:

  • PCA: `components_` contains principal axes, `explained_variance_ratio_` shows variance captured per component
  • UMAP: `embedding_` contains the manifold embedding, `graph_` contains the fuzzy simplicial set
  • t-SNE: `embedding_` contains the 2D visualization embedding, `kl_divergence_` measures optimization quality

Step 5: Transformation of New Data

For PCA and UMAP, transform new data points into the learned embedding space using `transform()`. This enables applying the dimensionality reduction to held-out test data or streaming data. t-SNE does not support transformation of new points; refit the model on the combined dataset instead.

Key considerations:

  • PCA's `transform()` projects data onto learned principal components
  • UMAP's `transform()` finds positions in the learned manifold for new points
  • PCA supports `inverse_transform()` for reconstructing approximate original data
  • UMAP optionally supports `inverse_transform()` for spatial reconstruction

Execution Diagram

GitHub URL

Workflow Repository