Principle:Rapidsai Cuml Multi GPU Linear Algebra

Knowledge Sources	Golub & Van Loan 2013 - Matrix Computations (4th ed.) NCCL Documentation cuSOLVER Documentation
Domains	Machine_Learning, Linear_Algebra, Distributed_Computing
Last Updated	2026-02-08 12:00 GMT

Overview

Multi-GPU linear algebra distributes large-scale matrix operations (eigendecomposition, regression solvers, data preprocessing) across multiple GPUs using collective communication primitives, enabling machine learning on datasets that exceed single-GPU memory.

Description

Many machine learning algorithms reduce to core linear algebra operations: matrix multiplication, eigendecomposition, singular value decomposition, and solving linear systems. When datasets are too large for a single GPU, these operations must be distributed across multiple GPUs (and potentially multiple nodes). This principle describes how fundamental linear algebra building blocks are parallelized in the multi-GPU (MG) setting.

Multi-GPU Eigendecomposition: PCA and related methods require computing eigenvalues and eigenvectors of covariance matrices. In the multi-GPU setting, the data matrix is partitioned across GPUs using a part descriptor that tracks which rows reside on which rank. The covariance matrix is computed in a distributed fashion using collective statistics operations (distributed mean, distributed covariance), and the eigendecomposition is performed to extract principal components, explained variances, and singular values. The truncated eigendecomposition retains only the top-k components needed.

Multi-GPU Coordinate Descent: Coordinate descent is an iterative optimization algorithm for solving regularized linear models (Lasso, Elastic Net). In the MG setting, each GPU holds a partition of the training data. At each iteration, coordinates (features) are updated one at a time, with soft-thresholding applied for L1 regularization. The gradient computation requires an inner product across all data partitions, which is accomplished via distributed matrix-vector products. The coordinate update order can be shuffled for better convergence.

Multi-GPU GLM Preprocessing: Before fitting generalized linear models, the data must be preprocessed by centering features and labels (subtracting means) when fitting an intercept. In the distributed setting, computing the mean requires a collective reduction across all GPU partitions. After model fitting, the intercept is recovered from the relationship between the coefficient vector and the feature/label means. This preprocessing is shared by OLS, Ridge, Lasso, and Elastic Net.

Multi-GPU Sign Flip: After computing eigenvectors (principal components) in a distributed PCA, the sign of each component is ambiguous (both $v$ and $- v$ are valid eigenvectors). Sign flip stabilization ensures deterministic results by enforcing the convention that the element with the largest absolute value in each component is positive. In the MG setting, finding the maximum absolute value requires scanning across all data partitions on different GPUs and applying the sign flip consistently.

Usage

Multi-GPU linear algebra is the right choice when:

The dataset exceeds single-GPU memory and must be partitioned across devices.
Training time on a single GPU is prohibitive and can be reduced by data parallelism.
The ML algorithm is built on linear algebra primitives (PCA, linear regression, ridge regression, lasso, elastic net).
A multi-GPU or multi-node cluster is available with fast interconnects (NVLink, InfiniBand).
Deterministic, reproducible results are needed across distributed runs (sign flip stabilization).

Theoretical Basis

Distributed Covariance Computation:

$C = \frac{1}{n - 1} \sum_{g = 1}^{G} (X_{g} - \bar{X})^{T} (X_{g} - \bar{X})$

where $X_{g}$ is the data partition on GPU $g$ , $\bar{X}$ is the global mean computed via allreduce, and the partial products are summed across GPUs.

Coordinate Descent Update (Elastic Net):

For each feature j (possibly shuffled):
    residual = y - X * beta + X_j * beta_j
    rho_j = (1/n) * X_j^T * residual    # distributed inner product
    beta_j = soft_threshold(rho_j, alpha * l1_ratio) / (1 + alpha * (1 - l1_ratio))

Soft Thresholding:

$S (ρ, λ) = sign (ρ) \cdot \max (| ρ | - λ, 0)$

Sign Flip Convention:

For each component v_k:
    j* = argmax_j |v_k[j]|    # across all GPU partitions
    If v_k[j*] < 0:
        v_k = -v_k

Intercept Recovery:

$intercept = \bar{y} - {\bar{X}}^{T} β$

where $\bar{y}$ and $\bar{X}$ are the global means computed during preprocessing.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment