Implementation:Scikit learn contrib Imbalanced learn ClusterCentroids
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Under_Sampling, Clustering |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
Concrete tool for under-sampling majority classes by generating cluster centroids provided by the imbalanced-learn library.
Description
The ClusterCentroids class implements a prototype generation under-sampling technique. It extends BaseUnderSampler and reduces the majority class by replacing clusters of majority samples with their centroid computed via KMeans clustering. Given a target count of N majority samples, the algorithm fits KMeans with N clusters to the majority class and uses the coordinates of the N cluster centroids as the new representative majority samples.
The class supports three voting strategies for generating replacement samples:
- "soft" voting uses the centroid coordinates directly as synthetic replacement samples. This is the default for dense input data.
- "hard" voting finds the nearest neighbor from the original majority samples to each centroid and uses those original samples instead. This is the default for sparse input data.
- "auto" selects "hard" when the input is sparse and "soft" otherwise.
The algorithm processes each class independently, applying clustering only to classes specified in the sampling strategy while leaving other classes unchanged.
Usage
Import this class when you want to reduce the majority class by creating representative synthetic samples that summarize clusters of similar majority instances, rather than simply selecting or removing existing samples. This is particularly useful when you want to preserve the underlying distribution structure of the majority class while reducing its size.
Code Reference
Source Location
- Repository: Scikit_learn_contrib_Imbalanced_learn
- File: imblearn/under_sampling/_prototype_generation/_cluster_centroids.py
- Lines: L28-210
Signature
class ClusterCentroids(BaseUnderSampler):
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
estimator=None,
voting="auto",
):
"""
Args:
sampling_strategy: str, dict, or callable - Desired ratio of
samples after resampling. 'auto' equalizes all classes to
the minority class count.
random_state: int, RandomState, or None - Seed for reproducibility.
Controls the random state of the underlying KMeans estimator.
estimator: estimator object or None - A scikit-learn compatible
clustering method exposing a `n_clusters` parameter and a
`cluster_centers_` fitted attribute. Defaults to KMeans.
voting: {'hard', 'soft', 'auto'} - Strategy for generating new
samples from centroids (default: 'auto').
"""
Import
from imblearn.under_sampling import ClusterCentroids
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| X | {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) | Yes | Feature matrix of training data |
| y | array-like of shape (n_samples,) | Yes | Target labels |
| sampling_strategy | str, dict, or callable | No | Resampling ratio (default: 'auto') |
| estimator | estimator object or None | No | Clustering estimator with `n_clusters` param (default: KMeans) |
| voting | {'hard', 'soft', 'auto'} | No | Voting strategy for centroid generation (default: 'auto') |
| random_state | int, RandomState, or None | No | Random seed for reproducibility |
Outputs
| Name | Type | Description |
|---|---|---|
| X_resampled | {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) | Feature matrix with majority class replaced by cluster centroids |
| y_resampled | ndarray of shape (n_samples_new,) | Target array with labels for resampled data |
Fitted Attributes
| Name | Type | Description |
|---|---|---|
| sampling_strategy_ | dict | Maps class labels to the number of samples to generate |
| estimator_ | estimator object | The validated clustering estimator (KMeans by default) |
| voting_ | str | The resolved voting strategy ('hard' or 'soft') |
| n_features_in_ | int | Number of features in the input dataset |
| feature_names_in_ | ndarray of shape (n_features_in_,) | Names of features seen during fit (only when X has string feature names) |
Usage Examples
Basic Usage
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import ClusterCentroids
# Create imbalanced dataset
X, y = make_classification(
n_classes=2, class_sep=2, weights=[0.1, 0.9],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# Apply ClusterCentroids under-sampling
cc = ClusterCentroids(random_state=42)
X_res, y_res = cc.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")
With Custom Clustering Estimator
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.cluster import MiniBatchKMeans
from imblearn.under_sampling import ClusterCentroids
X, y = make_classification(
n_classes=2, class_sep=2, weights=[0.1, 0.9],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=1000, random_state=10,
)
# Use MiniBatchKMeans for faster clustering on large datasets
cc = ClusterCentroids(
estimator=MiniBatchKMeans(n_init=1, random_state=0),
random_state=42,
)
X_res, y_res = cc.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")
In a Pipeline
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import ClusterCentroids
from sklearn.tree import DecisionTreeClassifier
pipeline = make_pipeline(
ClusterCentroids(random_state=42),
DecisionTreeClassifier(),
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
Hard Voting for Sparse Data
from imblearn.under_sampling import ClusterCentroids
# Use hard voting to select nearest original samples to centroids
# (automatically chosen for sparse input when voting='auto')
cc = ClusterCentroids(voting="hard", random_state=42)
X_res, y_res = cc.fit_resample(X_sparse, y)