Implementation:Rapidsai Cuml IncrementalPCA
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Dimensionality_Reduction |
| Last Updated | 2026-02-08 12:00 GMT |
Overview
IncrementalPCA provides a GPU-accelerated implementation of Incremental Principal Component Analysis that performs linear dimensionality reduction using SVD in a memory-efficient, batch-wise manner.
Description
The IncrementalPCA class extends the cuML PCA class to perform incremental principal components analysis (IPCA). Unlike standard PCA which requires loading the full dataset into memory, IncrementalPCA processes data in mini-batches, making it suitable for large datasets that cannot fit in GPU memory. It supports both dense and sparse input matrices (CSR format).
The algorithm centers the input data (but does not scale it) before applying SVD. It maintains running statistics across batches via the partial_fit method, allowing streaming or out-of-core learning. The computational overhead per SVD call is O(batch_size * n_features^2), and only 2 * batch_size samples are held in memory at a time. The implementation is based on sklearn.decomposition.IncrementalPCA from scikit-learn 0.23.1 and uses the incremental PCA model from Ross et al. (2008).
Usage
Use IncrementalPCA when you need to perform PCA on datasets that are too large to fit entirely in GPU memory, when working with streaming data that arrives in batches, or when dealing with sparse input matrices. It is also useful for reducing memory consumption compared to a full PCA while obtaining an approximate result.
Code Reference
Source Location
- Repository: Rapidsai_Cuml
- File:
python/cuml/cuml/decomposition/incremental_pca.py
Signature
class IncrementalPCA(PCA):
def __init__(
self,
*,
n_components=None,
whiten=False,
copy=True,
batch_size=None,
verbose=False,
output_type=None,
)
Import
from cuml.decomposition import IncrementalPCA
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| n_components | int or None | No | Number of components to keep. If None, set to min(n_samples, n_features). |
| whiten | bool | No | If True, de-correlates components by dividing by singular values and multiplying by sqrt(n_samples). Default is False. |
| copy | bool | No | If False, X will be overwritten to save memory. Default is True. |
| batch_size | int or None | No | Number of samples per batch for fit(). If None, defaults to 5 * n_features. |
| verbose | int or bool | No | Sets logging level. Default is False. |
| output_type | str or None | No | Return results in the indicated output type (e.g., 'cupy', 'numpy', 'cudf'). |
Outputs
| Name | Type | Description |
|---|---|---|
| components_ | array (n_components, n_features) | Principal axes in feature space representing maximum variance directions. |
| explained_variance_ | array (n_components,) | Variance explained by each selected component. |
| explained_variance_ratio_ | array (n_components,) | Percentage of variance explained by each selected component. |
| singular_values_ | array (n_components,) | Singular values corresponding to each selected component. |
| mean_ | array (n_features,) | Per-feature empirical mean, aggregated over calls to partial_fit. |
| var_ | array (n_features,) | Per-feature empirical variance, aggregated over calls to partial_fit. |
| noise_variance_ | float | Estimated noise covariance following the Probabilistic PCA model. |
| n_components_ | int | The estimated number of components. |
| n_samples_seen_ | int | The number of samples processed by the estimator. |
| batch_size_ | int | Inferred batch size from batch_size parameter. |
Usage Examples
Basic Usage
from cuml.decomposition import IncrementalPCA
import cupy as cp
import cupyx
# Create a sparse random matrix
X = cupyx.scipy.sparse.random(1000, 4, format='csr', density=0.07, random_state=5)
# Fit IncrementalPCA with 2 components and batch size of 200
ipca = IncrementalPCA(n_components=2, batch_size=200)
ipca.fit(X)
# Access results
print(ipca.components_)
print(ipca.singular_values_)
print(ipca.explained_variance_)
print(ipca.explained_variance_ratio_)
# Transform new data
X_transformed = ipca.transform(X)
Incremental Fitting with partial_fit
from cuml.decomposition import IncrementalPCA
import cupy as cp
ipca = IncrementalPCA(n_components=2)
# Simulate streaming data in batches
for i in range(5):
X_batch = cp.random.rand(100, 10).astype(cp.float32)
ipca.partial_fit(X_batch)
# Transform data after incremental fitting
X_new = cp.random.rand(50, 10).astype(cp.float32)
X_transformed = ipca.transform(X_new)