Implementation:Rapidsai Cuml Sign Flip MG

Knowledge Sources	Rapidsai_Cuml
Domains	Machine_Learning, Dimensionality_Reduction
Last Updated	2026-02-08 12:00 GMT

Overview

Provides multi-node multi-GPU (MNMG) sign-flip operations for stabilizing the sign of eigenvectors produced by PCA and tSVD decompositions.

Description

Eigenvectors are only determined up to a sign (i.e., if v is an eigenvector, so is -v). This ambiguity can cause inconsistent results across runs or across nodes in a distributed setting. The functions in this header resolve this by flipping the sign of eigenvector columns so that the largest absolute value in each column is positive, producing deterministic, reproducible results.

Two function families are provided:

sign_flip_components_u: Flips signs using the input data matrix to determine the dominant sign. Accepts additional parameters for number of samples, features, and an option to center the input data. This variant is used when the full U matrix (left singular vectors) context is needed.

sign_flip: A simpler variant that flips signs based on the input data and component matrix directly, requiring only the number of components.

Both families have float and double overloads and operate on distributed data described by MLCommon::Matrix::PartDescriptor, accepting multiple CUDA streams for concurrent execution across GPU partitions.

Usage

Use these functions after performing a distributed PCA or tSVD decomposition in a multi-GPU environment to ensure that the resulting eigenvectors have consistent signs across all nodes. This is essential for reproducibility in multi-node training pipelines.

Code Reference

Source Location

Repository: Rapidsai_Cuml
File: cpp/include/cuml/decomposition/sign_flip_mg.hpp

Signature

namespace ML {
namespace PCA {
namespace opg {

void sign_flip_components_u(raft::handle_t& handle,
                            std::vector<MLCommon::Matrix::Data<float>*>& input_data,
                            MLCommon::Matrix::PartDescriptor& input_desc,
                            float* components,
                            std::size_t n_samples,
                            std::size_t n_features,
                            std::size_t n_components,
                            cudaStream_t* streams,
                            std::uint32_t n_stream,
                            bool center);

void sign_flip_components_u(raft::handle_t& handle,
                            std::vector<MLCommon::Matrix::Data<double>*>& input_data,
                            MLCommon::Matrix::PartDescriptor& input_desc,
                            double* components,
                            std::size_t n_samples,
                            std::size_t n_features,
                            std::size_t n_components,
                            cudaStream_t* streams,
                            std::uint32_t n_stream,
                            bool center);

void sign_flip(raft::handle_t& handle,
               std::vector<MLCommon::Matrix::Data<float>*>& input_data,
               MLCommon::Matrix::PartDescriptor& input_desc,
               float* components,
               std::size_t n_components,
               cudaStream_t* streams,
               std::uint32_t n_stream);

void sign_flip(raft::handle_t& handle,
               std::vector<MLCommon::Matrix::Data<double>*>& input_data,
               MLCommon::Matrix::PartDescriptor& input_desc,
               double* components,
               std::size_t n_components,
               cudaStream_t* streams,
               std::uint32_t n_stream);

};  // end namespace opg
};  // end namespace PCA
};  // end namespace ML

Import

#include <cuml/decomposition/sign_flip_mg.hpp>

I/O Contract

Inputs (sign_flip_components_u)

Name	Type	Required	Description
handle	raft::handle_t&	Yes	cuML handle for GPU resource management
input_data	std::vector<MLCommon::Matrix::Data<T>*>&	Yes	Distributed input matrix partitions on device
input_desc	MLCommon::Matrix::PartDescriptor&	Yes	MNMG descriptor for the input partitions
n_samples	std::size_t	Yes	Total number of rows in the input matrix
n_features	std::size_t	Yes	Number of columns in the input/components matrix
n_components	std::size_t	Yes	Number of rows in the components matrix
streams	cudaStream_t*	Yes	Array of CUDA streams for concurrent execution
n_stream	std::uint32_t	Yes	Number of CUDA streams
center	bool	Yes	Whether to center the input data by columns

Inputs (sign_flip)

Name	Type	Required	Description
handle	raft::handle_t&	Yes	cuML handle for GPU resource management
input_data	std::vector<MLCommon::Matrix::Data<T>*>&	Yes	Distributed input matrix partitions on device
input_desc	MLCommon::Matrix::PartDescriptor&	Yes	MNMG descriptor for the input partitions
n_components	std::size_t	Yes	Number of columns in the components matrix
streams	cudaStream_t*	Yes	Array of CUDA streams for concurrent execution
n_stream	std::uint32_t	Yes	Number of CUDA streams

Outputs

Name	Type	Description
components	float/double	Device pointer to the components matrix, modified in-place with corrected signs

Usage Examples

#include <cuml/decomposition/sign_flip_mg.hpp>
#include <cuml/prims/opg/matrix/data.hpp>
#include <cuml/prims/opg/matrix/part_descriptor.hpp>
#include <raft/core/handle.hpp>

void stabilize_pca_signs(raft::handle_t& handle,
                         std::vector<MLCommon::Matrix::Data<float>*>& input_data,
                         MLCommon::Matrix::PartDescriptor& input_desc,
                         float* components,
                         std::size_t n_components) {
    // Create CUDA streams for parallel execution
    int n_streams = 4;
    std::vector<cudaStream_t> streams(n_streams);
    for (int i = 0; i < n_streams; i++) {
        cudaStreamCreate(&streams[i]);
    }

    // Stabilize eigenvector signs across multiple GPUs
    ML::PCA::opg::sign_flip(handle,
                             input_data,
                             input_desc,
                             components,
                             n_components,
                             streams.data(),
                             n_streams);

    // Synchronize and clean up streams
    for (int i = 0; i < n_streams; i++) {
        cudaStreamSynchronize(streams[i]);
        cudaStreamDestroy(streams[i]);
    }
}

Related Pages

Environment:Rapidsai_Cuml_CUDA_GPU

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment