Overview
Concrete tool for configuring PCA, UMAP, and t-SNE dimensionality reduction algorithms provided by the cuML library.
Description
These constructors initialize the three primary dimensionality reduction estimators in cuML. Each constructor accepts algorithm-specific hyperparameters that control the behavior of the reduction:
- PCA.__init__ configures the number of components, SVD solver strategy (full eigendecomposition vs. iterative Jacobi), whitening, and convergence tolerance.
- UMAP.__init__ configures the neighborhood graph construction (n_neighbors, metric, build algorithm), embedding optimization (n_epochs, learning_rate, min_dist, spread), and initialization strategy.
- TSNE.__init__ configures the probability distribution parameters (perplexity, early/late exaggeration), the approximation method (exact, Barnes-Hut, or FFT), and gradient descent parameters (learning_rate, momentum).
Usage
Import and instantiate these classes when setting up a dimensionality reduction pipeline. Choose the class based on whether you need linear (PCA) or nonlinear (UMAP, t-SNE) reduction, then configure the hyperparameters for your specific dataset and goals.
Code Reference
PCA.__init__
Source Location
- Repository: cuML
- File:
python/cuml/cuml/decomposition/pca.pyx
- Lines: 323-341
Signature
def __init__(
self,
*,
copy=True,
iterated_power=15,
n_components=None,
svd_solver='auto',
tol=1e-7,
verbose=False,
whiten=False,
output_type=None,
):
Import
from cuml import PCA
# or
from cuml.decomposition import PCA
UMAP.__init__
Source Location
- Repository: cuML
- File:
python/cuml/cuml/manifold/umap/umap.pyx
- Lines: 1052-1111
Signature
def __init__(
self,
*,
n_neighbors=15,
n_components=2,
metric="euclidean",
metric_kwds=None,
n_epochs=None,
learning_rate=1.0,
min_dist=0.1,
spread=1.0,
set_op_mix_ratio=1.0,
local_connectivity=1.0,
repulsion_strength=1.0,
negative_sample_rate=5,
transform_queue_size=4.0,
init="spectral",
a=None,
b=None,
target_n_neighbors=-1,
target_weight=0.5,
target_metric="categorical",
hash_input=False,
random_state=None,
precomputed_knn=None,
callback=None,
build_algo="auto",
build_kwds=None,
device_ids=None,
verbose=False,
output_type=None,
):
Import
from cuml import UMAP
# or
from cuml.manifold import UMAP
TSNE.__init__
Source Location
- Repository: cuML
- File:
python/cuml/cuml/manifold/t_sne.pyx
- Lines: 507-557
Signature
def __init__(
self,
*,
n_components=2,
perplexity=30.0,
early_exaggeration=12.0,
late_exaggeration=1.0,
learning_rate=200.0,
max_iter=1000,
n_iter_without_progress=300,
min_grad_norm=1e-07,
metric='euclidean',
metric_params=None,
init='random',
random_state=None,
method='fft',
angle=0.5,
n_neighbors=90,
perplexity_max_iter=100,
exaggeration_iter=250,
pre_momentum=0.5,
post_momentum=0.8,
learning_rate_method='adaptive',
square_distances=True,
precomputed_knn=None,
verbose=False,
output_type=None,
):
Import
from cuml import TSNE
# or
from cuml.manifold import TSNE
I/O Contract
PCA Inputs
| Name |
Type |
Required |
Description
|
| copy |
bool |
No (default True) |
If True, copies data then removes mean. False may overwrite input with mean-centered version.
|
| iterated_power |
int |
No (default 15) |
Number of iterations for the Jacobi solver. More iterations yield higher accuracy at slower speed.
|
| n_components |
int or None |
No (default None) |
Number of top K singular vectors to keep. If None, keeps min(n_samples, n_features).
|
| svd_solver |
str |
No (default 'auto') |
One of 'full', 'jacobi', or 'auto'. 'full' uses eigendecomposition; 'jacobi' is iterative and faster but less accurate.
|
| tol |
float |
No (default 1e-7) |
Convergence tolerance for Jacobi solver. Smaller values increase accuracy but slow convergence.
|
| verbose |
int or bool |
No (default False) |
Sets logging level.
|
| whiten |
bool |
No (default False) |
If True, divides components by singular values and multiplies by sqrt(n_samples) for unit variance.
|
| output_type |
str or None |
No (default None) |
Output data type format ('array', 'dataframe', 'cupy', 'numpy', etc.).
|
UMAP Inputs
| Name |
Type |
Required |
Description
|
| n_neighbors |
float |
No (default 15) |
Size of local neighborhood for manifold approximation. Range 2-100.
|
| n_components |
int |
No (default 2) |
Dimension of the target embedding space.
|
| metric |
str |
No (default 'euclidean') |
Distance metric. Supports 'euclidean', 'manhattan', 'cosine', 'correlation', 'chebyshev', 'minkowski', 'hamming', 'jaccard', and others.
|
| metric_kwds |
dict or None |
No (default None) |
Arguments for parameterized metrics (e.g., Minkowski p).
|
| n_epochs |
int or None |
No (default None) |
Number of training epochs. None selects automatically (200 for large, 500 for small datasets).
|
| learning_rate |
float |
No (default 1.0) |
Initial learning rate for embedding optimization.
|
| min_dist |
float |
No (default 0.1) |
Minimum distance between embedded points. Smaller values produce tighter clusters.
|
| spread |
float |
No (default 1.0) |
Effective scale of embedded points.
|
| init |
str |
No (default 'spectral') |
Initialization method: 'spectral', 'random', or an array-like of initial positions.
|
| build_algo |
str |
No (default 'auto') |
KNN build algorithm: 'auto', 'brute_force_knn', or 'nn_descent'.
|
| random_state |
int or None |
No (default None) |
Seed for reproducible embeddings.
|
| hash_input |
bool |
No (default False) |
Hash training input to return exact embeddings on transform of same data.
|
TSNE Inputs
| Name |
Type |
Required |
Description
|
| n_components |
int |
No (default 2) |
Output dimensionality. Currently only 2 is supported.
|
| perplexity |
float |
No (default 30.0) |
Related to number of nearest neighbors. Larger values for larger datasets. Range 5-50.
|
| early_exaggeration |
float |
No (default 12.0) |
Controls space between clusters during early optimization.
|
| late_exaggeration |
float |
No (default 1.0) |
Controls cluster separation after exaggeration_iter iterations (FFT only).
|
| learning_rate |
float |
No (default 200.0) |
Learning rate, typically between 10 and 1000.
|
| max_iter |
int |
No (default 1000) |
Maximum number of optimization iterations.
|
| method |
str |
No (default 'fft') |
Algorithm: 'fft' (fast), 'barnes_hut' (fast approximation), or 'exact' (accurate but slow).
|
| angle |
float |
No (default 0.5) |
Speed/accuracy trade-off for Barnes-Hut. Range 0.0-1.0.
|
| metric |
str |
No (default 'euclidean') |
Distance metric. Supports 'euclidean', 'manhattan', 'cosine', 'correlation', 'chebyshev', 'minkowski', 'sqeuclidean'.
|
| init |
str |
No (default 'random') |
Initialization: 'random' or 'pca'.
|
| random_state |
int or None |
No (default None) |
Seed for initialization. Note: results are not fully deterministic.
|
Outputs
| Name |
Type |
Description
|
| PCA instance |
PCA |
Configured PCA estimator ready for fitting.
|
| UMAP instance |
UMAP |
Configured UMAP estimator ready for fitting.
|
| TSNE instance |
TSNE |
Configured TSNE estimator ready for fitting.
|
Usage Examples
PCA Configuration
from cuml.decomposition import PCA
# Basic PCA with 3 components
pca = PCA(n_components=3)
# PCA with Jacobi solver for faster computation
pca_fast = PCA(n_components=50, svd_solver='jacobi', iterated_power=20, tol=1e-5)
# PCA with whitening for downstream linear models
pca_white = PCA(n_components=10, whiten=True)
UMAP Configuration
from cuml.manifold import UMAP
# Basic 2D visualization
umap = UMAP(n_components=2, n_neighbors=15, min_dist=0.1)
# Tighter clusters with more neighbors
umap_tight = UMAP(n_neighbors=50, min_dist=0.01, spread=1.0, n_epochs=500)
# Reproducible embedding with NN Descent for large data
umap_repro = UMAP(random_state=42, build_algo='nn_descent')
TSNE Configuration
from cuml.manifold import TSNE
# Basic 2D t-SNE with FFT approximation
tsne = TSNE(n_components=2, method='fft')
# Higher perplexity for larger datasets
tsne_large = TSNE(perplexity=50.0, learning_rate=500.0, max_iter=2000)
# Exact algorithm for small datasets
tsne_exact = TSNE(method='exact', perplexity=15.0, random_state=42)
Related Pages
Implements Principle
Requires Environment