Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn contrib Imbalanced learn ClusterCentroids

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Under_Sampling, Clustering
Last Updated 2026-02-09 03:00 GMT

Overview

Concrete tool for under-sampling majority classes by generating cluster centroids provided by the imbalanced-learn library.

Description

The ClusterCentroids class implements a prototype generation under-sampling technique. It extends BaseUnderSampler and reduces the majority class by replacing clusters of majority samples with their centroid computed via KMeans clustering. Given a target count of N majority samples, the algorithm fits KMeans with N clusters to the majority class and uses the coordinates of the N cluster centroids as the new representative majority samples.

The class supports three voting strategies for generating replacement samples:

  • "soft" voting uses the centroid coordinates directly as synthetic replacement samples. This is the default for dense input data.
  • "hard" voting finds the nearest neighbor from the original majority samples to each centroid and uses those original samples instead. This is the default for sparse input data.
  • "auto" selects "hard" when the input is sparse and "soft" otherwise.

The algorithm processes each class independently, applying clustering only to classes specified in the sampling strategy while leaving other classes unchanged.

Usage

Import this class when you want to reduce the majority class by creating representative synthetic samples that summarize clusters of similar majority instances, rather than simply selecting or removing existing samples. This is particularly useful when you want to preserve the underlying distribution structure of the majority class while reducing its size.

Code Reference

Source Location

Signature

class ClusterCentroids(BaseUnderSampler):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        estimator=None,
        voting="auto",
    ):
        """
        Args:
            sampling_strategy: str, dict, or callable - Desired ratio of
                samples after resampling. 'auto' equalizes all classes to
                the minority class count.
            random_state: int, RandomState, or None - Seed for reproducibility.
                Controls the random state of the underlying KMeans estimator.
            estimator: estimator object or None - A scikit-learn compatible
                clustering method exposing a `n_clusters` parameter and a
                `cluster_centers_` fitted attribute. Defaults to KMeans.
            voting: {'hard', 'soft', 'auto'} - Strategy for generating new
                samples from centroids (default: 'auto').
        """

Import

from imblearn.under_sampling import ClusterCentroids

I/O Contract

Inputs

Name Type Required Description
X {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) Yes Feature matrix of training data
y array-like of shape (n_samples,) Yes Target labels
sampling_strategy str, dict, or callable No Resampling ratio (default: 'auto')
estimator estimator object or None No Clustering estimator with `n_clusters` param (default: KMeans)
voting {'hard', 'soft', 'auto'} No Voting strategy for centroid generation (default: 'auto')
random_state int, RandomState, or None No Random seed for reproducibility

Outputs

Name Type Description
X_resampled {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) Feature matrix with majority class replaced by cluster centroids
y_resampled ndarray of shape (n_samples_new,) Target array with labels for resampled data

Fitted Attributes

Name Type Description
sampling_strategy_ dict Maps class labels to the number of samples to generate
estimator_ estimator object The validated clustering estimator (KMeans by default)
voting_ str The resolved voting strategy ('hard' or 'soft')
n_features_in_ int Number of features in the input dataset
feature_names_in_ ndarray of shape (n_features_in_,) Names of features seen during fit (only when X has string feature names)

Usage Examples

Basic Usage

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import ClusterCentroids

# Create imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")

# Apply ClusterCentroids under-sampling
cc = ClusterCentroids(random_state=42)
X_res, y_res = cc.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")

With Custom Clustering Estimator

from collections import Counter
from sklearn.datasets import make_classification
from sklearn.cluster import MiniBatchKMeans
from imblearn.under_sampling import ClusterCentroids

X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)

# Use MiniBatchKMeans for faster clustering on large datasets
cc = ClusterCentroids(
    estimator=MiniBatchKMeans(n_init=1, random_state=0),
    random_state=42,
)
X_res, y_res = cc.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")

In a Pipeline

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import ClusterCentroids
from sklearn.tree import DecisionTreeClassifier

pipeline = make_pipeline(
    ClusterCentroids(random_state=42),
    DecisionTreeClassifier(),
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Hard Voting for Sparse Data

from imblearn.under_sampling import ClusterCentroids

# Use hard voting to select nearest original samples to centroids
# (automatically chosen for sparse input when voting='auto')
cc = ClusterCentroids(voting="hard", random_state=42)
X_res, y_res = cc.fit_resample(X_sparse, y)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment