Principle:Scikit learn contrib Imbalanced learn Cluster Based Oversampling
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
An oversampling strategy that clusters the input space first, then applies SMOTE selectively within clusters that contain sufficient minority samples and are not already balanced.
Description
Cluster-based oversampling combines clustering with synthetic minority oversampling. The algorithm first partitions the data using KMeans, then evaluates each cluster's class balance. SMOTE is applied only within clusters that meet a balance threshold, ensuring synthetic samples are generated in meaningful regions of the feature space rather than in sparse or already-balanced areas.
This prevents the generation of noisy synthetic samples in regions dominated by majority class instances and focuses oversampling in clusters where minority samples genuinely exist.
Usage
Use this principle when:
- The minority class is distributed across distinct sub-clusters
- Standard SMOTE generates noisy samples between distant minority regions
- The data has a natural cluster structure
- Reducing synthetic noise is more important than maximizing minority coverage
Theoretical Basis
- Cluster all data using KMeans
- Filter clusters: keep only those where minority density exceeds a threshold
- Distribute synthetic sample generation across filtered clusters proportional to their minority density
- Apply SMOTE within each selected cluster
# Abstract KMeans-SMOTE algorithm (NOT real implementation)
clusters = KMeans(n_clusters=k).fit_predict(X)
for cluster_id in unique(clusters):
minority_ratio = count_minority(cluster_id) / count_all(cluster_id)
if minority_ratio >= cluster_balance_threshold:
n_synthetic = proportional_allocation(cluster_id)
apply_smote_in_cluster(cluster_id, n_synthetic)