Implementation:Rapidsai Cuml IncrementalPCA

Knowledge Sources	Rapidsai_Cuml
Domains	Machine_Learning, Dimensionality_Reduction
Last Updated	2026-02-08 12:00 GMT

Overview

IncrementalPCA provides a GPU-accelerated implementation of Incremental Principal Component Analysis that performs linear dimensionality reduction using SVD in a memory-efficient, batch-wise manner.

Description

The IncrementalPCA class extends the cuML PCA class to perform incremental principal components analysis (IPCA). Unlike standard PCA which requires loading the full dataset into memory, IncrementalPCA processes data in mini-batches, making it suitable for large datasets that cannot fit in GPU memory. It supports both dense and sparse input matrices (CSR format).

The algorithm centers the input data (but does not scale it) before applying SVD. It maintains running statistics across batches via the partial_fit method, allowing streaming or out-of-core learning. The computational overhead per SVD call is O(batch_size * n_features^2), and only 2 * batch_size samples are held in memory at a time. The implementation is based on sklearn.decomposition.IncrementalPCA from scikit-learn 0.23.1 and uses the incremental PCA model from Ross et al. (2008).

Usage

Use IncrementalPCA when you need to perform PCA on datasets that are too large to fit entirely in GPU memory, when working with streaming data that arrives in batches, or when dealing with sparse input matrices. It is also useful for reducing memory consumption compared to a full PCA while obtaining an approximate result.

Code Reference

Source Location

Repository: Rapidsai_Cuml
File: python/cuml/cuml/decomposition/incremental_pca.py

Signature

class IncrementalPCA(PCA):
    def __init__(
        self,
        *,
        n_components=None,
        whiten=False,
        copy=True,
        batch_size=None,
        verbose=False,
        output_type=None,
    )

Import

from cuml.decomposition import IncrementalPCA

I/O Contract

Inputs

Name	Type	Required	Description
n_components	int or None	No	Number of components to keep. If None, set to min(n_samples, n_features).
whiten	bool	No	If True, de-correlates components by dividing by singular values and multiplying by sqrt(n_samples). Default is False.
copy	bool	No	If False, X will be overwritten to save memory. Default is True.
batch_size	int or None	No	Number of samples per batch for fit(). If None, defaults to 5 * n_features.
verbose	int or bool	No	Sets logging level. Default is False.
output_type	str or None	No	Return results in the indicated output type (e.g., 'cupy', 'numpy', 'cudf').

Outputs

Name	Type	Description
components_	array (n_components, n_features)	Principal axes in feature space representing maximum variance directions.
explained_variance_	array (n_components,)	Variance explained by each selected component.
explained_variance_ratio_	array (n_components,)	Percentage of variance explained by each selected component.
singular_values_	array (n_components,)	Singular values corresponding to each selected component.
mean_	array (n_features,)	Per-feature empirical mean, aggregated over calls to partial_fit.
var_	array (n_features,)	Per-feature empirical variance, aggregated over calls to partial_fit.
noise_variance_	float	Estimated noise covariance following the Probabilistic PCA model.
n_components_	int	The estimated number of components.
n_samples_seen_	int	The number of samples processed by the estimator.
batch_size_	int	Inferred batch size from batch_size parameter.

Usage Examples

Basic Usage

from cuml.decomposition import IncrementalPCA
import cupy as cp
import cupyx

# Create a sparse random matrix
X = cupyx.scipy.sparse.random(1000, 4, format='csr', density=0.07, random_state=5)

# Fit IncrementalPCA with 2 components and batch size of 200
ipca = IncrementalPCA(n_components=2, batch_size=200)
ipca.fit(X)

# Access results
print(ipca.components_)
print(ipca.singular_values_)
print(ipca.explained_variance_)
print(ipca.explained_variance_ratio_)

# Transform new data
X_transformed = ipca.transform(X)

Incremental Fitting with partial_fit

from cuml.decomposition import IncrementalPCA
import cupy as cp

ipca = IncrementalPCA(n_components=2)

# Simulate streaming data in batches
for i in range(5):
    X_batch = cp.random.rand(100, 10).astype(cp.float32)
    ipca.partial_fit(X_batch)

# Transform data after incremental fitting
X_new = cp.random.rand(50, 10).astype(cp.float32)
X_transformed = ipca.transform(X_new)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment