Implementation:Rapidsai Cuml Kmeans Sampling
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Explainability, Clustering, Data_Summarization |
| Last Updated | 2026-02-08 12:00 GMT |
Overview
Summarizes a dataset using weighted K-Means clustering, producing a compact set of representative samples for use with SHAP explainability methods.
Description
The kmeans_sampling function reduces a dataset X of shape (n_samples, n_features) to k representative cluster centers using GPU-accelerated K-Means. This is adapted from the SHAP library's dataset summarization utility.
Key steps:
- Coerces the input (cuDF, Pandas, NumPy, CuPy, Numba) into a CuPy array.
- Imputes missing values using
SimpleImputerwith mean strategy. - Fits a
KMeansmodel withkclusters. - Optionally rounds each feature dimension of each cluster center to the nearest observed value in the original data, ensuring that discrete features retain valid values.
- Returns the cluster centers as a summary, with optional group names and per-sample cluster labels.
This function is part of the cuml.explainer module and is typically called internally by SHAP-based explainers (KernelExplainer, PermutationExplainer) to create a background dataset summary.
Usage
Use this function when you need a compact, representative summary of a large dataset for model explanation. It is especially useful as a background dataset for SHAP explainers, where using the full dataset would be computationally prohibitive.
Code Reference
Source Location
- Repository: Rapidsai_Cuml
- File:
python/cuml/cuml/explainer/sampling.py
Signature
def kmeans_sampling(X, k, round_values=True, detailed=False, random_state=0)
Import
from cuml.explainer.sampling import kmeans_sampling
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| X | cuDF/Pandas DataFrame/Series, NumPy ndarray, CuPy array, or cuda_array_interface device array | Yes | Dataset to summarize, shape (n_samples, n_features).
|
| k | int | Yes | Number of cluster centers (means) to produce as the summary. |
| round_values | bool | No | If True (default), round each feature of each cluster center to the nearest observed value in X, preserving validity of discrete features.
|
| detailed | bool | No | If True, return a tuple of (summary, group_names, labels). Default False.
|
| random_state | int | No | Random seed for K-Means. Default 0.
|
Outputs
| Name | Type | Description |
|---|---|---|
| summary | array of shape (k, n_features) | The K-Means cluster centers representing the summarized dataset. |
| group_names | list of str (only when detailed=True) |
Column / feature names. |
| labels | array of shape (n_samples, 1) (only when detailed=True) |
Cluster labels for every sample in the original dataset. |
Usage Examples
import cupy as cp
from cuml.explainer.sampling import kmeans_sampling
# Create a synthetic dataset
X = cp.random.rand(1000, 5)
# Summarize to 10 representative samples
summary = kmeans_sampling(X, k=10)
print(summary.shape) # (10, 5)
# Get detailed output with cluster labels
summary, group_names, labels = kmeans_sampling(X, k=10, detailed=True)
print("Feature names:", group_names)
print("Labels shape:", labels.shape)