Implementation:Scikit learn contrib Imbalanced learn ClusterCentroids

Knowledge Sources	Scikit_learn_contrib_Imbalanced_learn imbalanced-learn Docs
Domains	Machine_Learning, Under_Sampling, Clustering
Last Updated	2026-02-09 03:00 GMT

Overview

Concrete tool for under-sampling majority classes by generating cluster centroids provided by the imbalanced-learn library.

Description

The ClusterCentroids class implements a prototype generation under-sampling technique. It extends BaseUnderSampler and reduces the majority class by replacing clusters of majority samples with their centroid computed via KMeans clustering. Given a target count of N majority samples, the algorithm fits KMeans with N clusters to the majority class and uses the coordinates of the N cluster centroids as the new representative majority samples.

The class supports three voting strategies for generating replacement samples:

"soft" voting uses the centroid coordinates directly as synthetic replacement samples. This is the default for dense input data.
"hard" voting finds the nearest neighbor from the original majority samples to each centroid and uses those original samples instead. This is the default for sparse input data.
"auto" selects "hard" when the input is sparse and "soft" otherwise.

The algorithm processes each class independently, applying clustering only to classes specified in the sampling strategy while leaving other classes unchanged.

Usage

Import this class when you want to reduce the majority class by creating representative synthetic samples that summarize clusters of similar majority instances, rather than simply selecting or removing existing samples. This is particularly useful when you want to preserve the underlying distribution structure of the majority class while reducing its size.

Code Reference

Source Location

Repository: Scikit_learn_contrib_Imbalanced_learn
File: imblearn/under_sampling/_prototype_generation/_cluster_centroids.py
Lines: L28-210

Signature

class ClusterCentroids(BaseUnderSampler):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        estimator=None,
        voting="auto",
    ):
        """
        Args:
            sampling_strategy: str, dict, or callable - Desired ratio of
                samples after resampling. 'auto' equalizes all classes to
                the minority class count.
            random_state: int, RandomState, or None - Seed for reproducibility.
                Controls the random state of the underlying KMeans estimator.
            estimator: estimator object or None - A scikit-learn compatible
                clustering method exposing a `n_clusters` parameter and a
                `cluster_centers_` fitted attribute. Defaults to KMeans.
            voting: {'hard', 'soft', 'auto'} - Strategy for generating new
                samples from centroids (default: 'auto').
        """

Import

from imblearn.under_sampling import ClusterCentroids

I/O Contract

Inputs

Name	Type	Required	Description
X	{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)	Yes	Feature matrix of training data
y	array-like of shape (n_samples,)	Yes	Target labels
sampling_strategy	str, dict, or callable	No	Resampling ratio (default: 'auto')
estimator	estimator object or None	No	Clustering estimator with `n_clusters` param (default: KMeans)
voting	{'hard', 'soft', 'auto'}	No	Voting strategy for centroid generation (default: 'auto')
random_state	int, RandomState, or None	No	Random seed for reproducibility

Outputs

Name	Type	Description
X_resampled	{ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features)	Feature matrix with majority class replaced by cluster centroids
y_resampled	ndarray of shape (n_samples_new,)	Target array with labels for resampled data

Fitted Attributes

Name	Type	Description
sampling_strategy_	dict	Maps class labels to the number of samples to generate
estimator_	estimator object	The validated clustering estimator (KMeans by default)
voting_	str	The resolved voting strategy ('hard' or 'soft')
n_features_in_	int	Number of features in the input dataset
feature_names_in_	ndarray of shape (n_features_in_,)	Names of features seen during fit (only when X has string feature names)

Usage Examples

Basic Usage

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import ClusterCentroids

# Create imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")

# Apply ClusterCentroids under-sampling
cc = ClusterCentroids(random_state=42)
X_res, y_res = cc.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")

With Custom Clustering Estimator

from collections import Counter
from sklearn.datasets import make_classification
from sklearn.cluster import MiniBatchKMeans
from imblearn.under_sampling import ClusterCentroids

X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)

# Use MiniBatchKMeans for faster clustering on large datasets
cc = ClusterCentroids(
    estimator=MiniBatchKMeans(n_init=1, random_state=0),
    random_state=42,
)
X_res, y_res = cc.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")

In a Pipeline

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import ClusterCentroids
from sklearn.tree import DecisionTreeClassifier

pipeline = make_pipeline(
    ClusterCentroids(random_state=42),
    DecisionTreeClassifier(),
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Hard Voting for Sparse Data

from imblearn.under_sampling import ClusterCentroids

# Use hard voting to select nearest original samples to centroids
# (automatically chosen for sparse input when voting='auto')
cc = ClusterCentroids(voting="hard", random_state=42)
X_res, y_res = cc.fit_resample(X_sparse, y)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment