Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Rapidsai Cuml Kmeans Sampling

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Explainability, Clustering, Data_Summarization
Last Updated 2026-02-08 12:00 GMT

Overview

Summarizes a dataset using weighted K-Means clustering, producing a compact set of representative samples for use with SHAP explainability methods.

Description

The kmeans_sampling function reduces a dataset X of shape (n_samples, n_features) to k representative cluster centers using GPU-accelerated K-Means. This is adapted from the SHAP library's dataset summarization utility.

Key steps:

  1. Coerces the input (cuDF, Pandas, NumPy, CuPy, Numba) into a CuPy array.
  2. Imputes missing values using SimpleImputer with mean strategy.
  3. Fits a KMeans model with k clusters.
  4. Optionally rounds each feature dimension of each cluster center to the nearest observed value in the original data, ensuring that discrete features retain valid values.
  5. Returns the cluster centers as a summary, with optional group names and per-sample cluster labels.

This function is part of the cuml.explainer module and is typically called internally by SHAP-based explainers (KernelExplainer, PermutationExplainer) to create a background dataset summary.

Usage

Use this function when you need a compact, representative summary of a large dataset for model explanation. It is especially useful as a background dataset for SHAP explainers, where using the full dataset would be computationally prohibitive.

Code Reference

Source Location

  • Repository: Rapidsai_Cuml
  • File: python/cuml/cuml/explainer/sampling.py

Signature

def kmeans_sampling(X, k, round_values=True, detailed=False, random_state=0)

Import

from cuml.explainer.sampling import kmeans_sampling

I/O Contract

Inputs

Name Type Required Description
X cuDF/Pandas DataFrame/Series, NumPy ndarray, CuPy array, or cuda_array_interface device array Yes Dataset to summarize, shape (n_samples, n_features).
k int Yes Number of cluster centers (means) to produce as the summary.
round_values bool No If True (default), round each feature of each cluster center to the nearest observed value in X, preserving validity of discrete features.
detailed bool No If True, return a tuple of (summary, group_names, labels). Default False.
random_state int No Random seed for K-Means. Default 0.

Outputs

Name Type Description
summary array of shape (k, n_features) The K-Means cluster centers representing the summarized dataset.
group_names list of str (only when detailed=True) Column / feature names.
labels array of shape (n_samples, 1) (only when detailed=True) Cluster labels for every sample in the original dataset.

Usage Examples

import cupy as cp
from cuml.explainer.sampling import kmeans_sampling

# Create a synthetic dataset
X = cp.random.rand(1000, 5)

# Summarize to 10 representative samples
summary = kmeans_sampling(X, k=10)
print(summary.shape)  # (10, 5)

# Get detailed output with cluster labels
summary, group_names, labels = kmeans_sampling(X, k=10, detailed=True)
print("Feature names:", group_names)
print("Labels shape:", labels.shape)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment