Principle:Scikit learn contrib Imbalanced learn Cluster Centroid Under Sampling
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Under_Sampling, Clustering |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
A prototype generation under-sampling technique that replaces majority class clusters with their centroids to reduce class imbalance.
Description
Cluster Centroid Under-Sampling is a prototype generation method that reduces the majority class by summarizing groups of similar majority samples into single representative points. Unlike prototype selection methods (such as Edited Nearest Neighbours or Condensed Nearest Neighbour) that choose a subset of existing samples to keep, this technique generates entirely new synthetic samples that represent the center of mass of each cluster.
The core idea is that if the majority class has N times more samples than desired, the majority samples can be grouped into the target number of clusters, and each cluster can be replaced by its centroid. This preserves the overall spatial distribution of the majority class while drastically reducing its size, because each centroid captures the average position of the samples it represents.
Usage
Use this principle when:
- The majority class significantly outnumbers the minority class and needs to be reduced
- You want to preserve the structural distribution of the majority class rather than randomly discarding samples
- Prototype generation (creating new representative samples) is preferred over prototype selection (choosing existing samples)
- The feature space is continuous and centroids are meaningful (e.g., not purely categorical data)
- You are willing to accept synthetic majority samples rather than retaining only original data points
Theoretical Basis
The algorithm applies K-Means clustering to the majority class to find representative points:
- Determine target count: Based on the sampling strategy, compute the desired number of majority samples N (typically matching the minority class count).
- Fit K-Means: Apply K-Means clustering with N clusters to the majority class samples only. Each cluster groups together similar majority samples.
- Extract centroids: The N cluster centroids become the new representative majority samples, each summarizing a region of the original majority class distribution.
- Reconstruct dataset: Combine the centroid-based majority samples with the unchanged minority class samples.
Voting strategies determine how centroids translate to final samples:
- "Soft" voting: Uses the centroid coordinates directly as synthetic samples. This produces points that may not correspond to any original sample but optimally represent the center of each cluster. This is appropriate for dense data where intermediate coordinate values are meaningful.
- "Hard" voting: For each centroid, finds the nearest original majority sample using a nearest-neighbor search and uses that original sample instead. This ensures all resulting samples are actual data points from the original dataset, which is important for sparse data where synthetic intermediate values (e.g., fractional word counts) may not be valid.
Pseudo-code:
# Abstract Cluster Centroid Under-Sampling algorithm (NOT real implementation)
N = desired_majority_count # typically equal to minority class count
# Fit clustering to majority class
kmeans = KMeans(n_clusters=N)
kmeans.fit(X_majority)
centroids = kmeans.cluster_centers_
if voting == "soft":
X_majority_new = centroids # use centroids directly
elif voting == "hard":
nn = NearestNeighbors(n_neighbors=1)
nn.fit(X_majority)
nearest_indices = nn.kneighbors(centroids)
X_majority_new = X_majority[nearest_indices] # use nearest original samples
X_resampled = concatenate(X_majority_new, X_minority)
Key properties:
- The method supports multi-class resampling by processing each class independently
- Any scikit-learn compatible clustering estimator that exposes an n_clusters parameter and a cluster_centers_ fitted attribute can be substituted for the default KMeans (e.g., MiniBatchKMeans for large datasets)
- Unlike random under-sampling, this method accounts for the spatial structure of the data, producing a more representative reduced majority set
- The method does not provide sample indices back to the original dataset (since soft voting creates synthetic samples), so the sample_indices tag is set to False