Implementation:Scikit learn contrib Imbalanced learn KMeansSMOTE

Knowledge Sources	imbalanced-learn imbalanced-learn Docs
Domains	Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated	2026-02-09 03:00 GMT

Overview

Concrete tool for cluster-aware synthetic oversampling provided by the imbalanced-learn library.

Description

The KMeansSMOTE class applies KMeans clustering before oversampling with SMOTE. It extends BaseSMOTE and uses a MiniBatchKMeans estimator by default for scalability. The cluster_balance_threshold parameter controls which clusters receive synthetic samples, and density_exponent weights the allocation across clusters.

Usage

Import this class when the data has natural cluster structure and you want to avoid generating synthetic samples in sparse or majority-dominated regions.

Code Reference

Source Location

Repository: imbalanced-learn
File: imblearn/over_sampling/_smote/cluster.py
Lines: L30-308

Signature

class KMeansSMOTE(BaseSMOTE):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        k_neighbors=2,
        n_jobs=None,
        kmeans_estimator=None,
        cluster_balance_threshold="auto",
        density_exponent="auto",
    ):
        """
        Args:
            sampling_strategy: str, dict, or callable - Resampling ratio.
            random_state: int, RandomState, or None - Seed.
            k_neighbors: int or NearestNeighbors - SMOTE neighbors (default: 2).
            n_jobs: int or None - Parallel jobs.
            kmeans_estimator: int or KMeans - Number of clusters or estimator
                (default: MiniBatchKMeans).
            cluster_balance_threshold: 'auto' or float - Min minority ratio
                per cluster.
            density_exponent: 'auto' or float - Exponent for density weighting.
        """

Import

from imblearn.over_sampling import KMeansSMOTE

I/O Contract

Inputs

Name	Type	Required	Description
X	{array-like, sparse matrix} of shape (n_samples, n_features)	Yes	Feature matrix
y	array-like of shape (n_samples,)	Yes	Target labels
kmeans_estimator	int or KMeans	No	Clustering estimator or number of clusters
cluster_balance_threshold	'auto' or float	No	Minimum minority ratio to oversample a cluster

Outputs

Name	Type	Description
X_resampled	ndarray of shape (n_samples_new, n_features)	Feature matrix with cluster-guided synthetic samples
y_resampled	ndarray of shape (n_samples_new,)	Target array

Usage Examples

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import KMeansSMOTE

X, y = make_classification(
    n_classes=2, weights=[0.1, 0.9], n_samples=1000,
    n_clusters_per_class=3, random_state=10
)
kmeans_smote = KMeansSMOTE(random_state=42, kmeans_estimator=10)
X_res, y_res = kmeans_smote.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")

Related Pages

Implements Principle

Principle:Scikit_learn_contrib_Imbalanced_learn_Cluster_Based_Oversampling

Requires Environment

Environment:Scikit_learn_contrib_Imbalanced_learn_Python_Scikit_learn

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment