Principle:Rapidsai Cuml Random Projection

Knowledge Sources	Johnson & Lindenstrauss 1984 - Extensions of Lipschitz Mappings into a Hilbert Space Achlioptas 2003 - Database-friendly Random Projections Li et al. 2006 - Very Sparse Random Projections
Domains	Machine_Learning, Dimensionality_Reduction, Linear_Algebra
Last Updated	2026-02-08 12:00 GMT

Overview

Random projection reduces the dimensionality of data by projecting it onto a lower-dimensional subspace using a random matrix, with theoretical guarantees from the Johnson-Lindenstrauss lemma that pairwise distances are approximately preserved.

Description

Random projection is a computationally efficient technique for dimensionality reduction that trades a small, controllable amount of accuracy for significant gains in speed and memory. Unlike methods such as PCA that compute data-dependent projections via eigendecomposition, random projection constructs a projection matrix independently of the data, requiring only knowledge of the desired output dimension.

The theoretical foundation is the Johnson-Lindenstrauss (JL) lemma, which states that any set of n points in high-dimensional Euclidean space can be embedded into a space of dimension $k = O (ϵ^{- 2} \log n)$ such that all pairwise distances are preserved within a factor of $(1 \pm ϵ)$ . This provides a principled way to choose the target dimension: given the number of samples and the acceptable distortion tolerance epsilon, the minimum safe number of components can be computed directly.

Two types of random projection matrices are commonly used:

Gaussian Random Projection: Each element of the projection matrix is drawn independently from a Gaussian distribution $N (0, 1 / k)$ where $k$ is the target dimensionality. This satisfies the JL lemma and produces a dense projection matrix. The projection is simply a matrix multiplication of the input data with the random matrix.

Sparse Random Projection: Following the work of Achlioptas and Li et al., the projection matrix is constructed with sparse entries. In the simplest form (Achlioptas), each entry is independently set to $+ 1 / \sqrt{s}$ , $0$ , or $- 1 / \sqrt{s}$ with probabilities $1 / (2 s)$ , $1 - 1 / s$ , and $1 / (2 s)$ respectively, where $s$ is a sparsity parameter (commonly $s = \sqrt{d}$ for input dimension $d$ ). The sparse matrix multiplication is substantially faster than the dense Gaussian variant, especially for high-dimensional inputs.

Usage

Random projection is the right choice when:

The input data is very high-dimensional and a fast, approximate dimensionality reduction is needed as a preprocessing step.
Exact preservation of distances is not required, but approximate preservation (within a tolerance epsilon) suffices.
The downstream task relies primarily on pairwise distances (e.g., nearest neighbor search, clustering).
PCA is too expensive to compute (cubic complexity in feature count) or unnecessary because the data does not have strong low-rank structure.
Sparse random projection is preferred when the input dimension is very large and compute time is critical.

Theoretical Basis

Johnson-Lindenstrauss Lemma:

For any $0 < ϵ < 1$ and any set $S$ of $n$ points in $ℝ^{d}$ , there exists a map $f : ℝ^{d} \to ℝ^{k}$ with $k = O (ϵ^{- 2} \log n)$ such that for all $u, v \in S$ :

$(1 - ϵ) ‖ u - v ‖^{2} \leq ‖ f (u) - f (v) ‖^{2} \leq (1 + ϵ) ‖ u - v ‖^{2}$

Minimum Safe Dimension:

$k \geq \frac{4 \ln n}{ϵ^{2} / 2 - ϵ^{3} / 3}$

Gaussian Projection:

$X_{proj} = X \cdot R, R_{i j} \sim N (0, 1 / k)$

where $X \in ℝ^{n \times d}$ is the input and $R \in ℝ^{d \times k}$ is the random projection matrix.

Sparse Projection (Achlioptas):

$R_{i j} = \sqrt{s} \cdot {\begin{cases} + 1 & with probability 1 / (2 s) \\ 0 & with probability 1 - 1 / s \\ - 1 & with probability 1 / (2 s) \end{cases}$

Related Pages

Implemented By

Implementation:Rapidsai_Cuml_RandomProjection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment