Implementation:Rapidsai Cuml Kmeans Sampling

Knowledge Sources	Rapidsai_Cuml
Domains	Machine_Learning, Explainability, Clustering, Data_Summarization
Last Updated	2026-02-08 12:00 GMT

Overview

Summarizes a dataset using weighted K-Means clustering, producing a compact set of representative samples for use with SHAP explainability methods.

Description

The kmeans_sampling function reduces a dataset X of shape (n_samples, n_features) to k representative cluster centers using GPU-accelerated K-Means. This is adapted from the SHAP library's dataset summarization utility.

Key steps:

Coerces the input (cuDF, Pandas, NumPy, CuPy, Numba) into a CuPy array.
Imputes missing values using SimpleImputer with mean strategy.
Fits a KMeans model with k clusters.
Optionally rounds each feature dimension of each cluster center to the nearest observed value in the original data, ensuring that discrete features retain valid values.
Returns the cluster centers as a summary, with optional group names and per-sample cluster labels.

This function is part of the cuml.explainer module and is typically called internally by SHAP-based explainers (KernelExplainer, PermutationExplainer) to create a background dataset summary.

Usage

Use this function when you need a compact, representative summary of a large dataset for model explanation. It is especially useful as a background dataset for SHAP explainers, where using the full dataset would be computationally prohibitive.

Code Reference

Source Location

Repository: Rapidsai_Cuml
File: python/cuml/cuml/explainer/sampling.py

Signature

def kmeans_sampling(X, k, round_values=True, detailed=False, random_state=0)

Import

from cuml.explainer.sampling import kmeans_sampling

I/O Contract

Inputs

Name	Type	Required	Description
X	cuDF/Pandas DataFrame/Series, NumPy ndarray, CuPy array, or cuda_array_interface device array	Yes	Dataset to summarize, shape `(n_samples, n_features)`.
k	int	Yes	Number of cluster centers (means) to produce as the summary.
round_values	bool	No	If `True` (default), round each feature of each cluster center to the nearest observed value in `X`, preserving validity of discrete features.
detailed	bool	No	If `True`, return a tuple of `(summary, group_names, labels)`. Default `False`.
random_state	int	No	Random seed for K-Means. Default `0`.

Outputs

Name	Type	Description
summary	array of shape (k, n_features)	The K-Means cluster centers representing the summarized dataset.
group_names	list of str (only when `detailed=True`)	Column / feature names.
labels	array of shape (n_samples, 1) (only when `detailed=True`)	Cluster labels for every sample in the original dataset.

Usage Examples

import cupy as cp
from cuml.explainer.sampling import kmeans_sampling

# Create a synthetic dataset
X = cp.random.rand(1000, 5)

# Summarize to 10 representative samples
summary = kmeans_sampling(X, k=10)
print(summary.shape)  # (10, 5)

# Get detailed output with cluster labels
summary, group_names, labels = kmeans_sampling(X, k=10, detailed=True)
print("Feature names:", group_names)
print("Labels shape:", labels.shape)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment