Principle:Scikit learn contrib Imbalanced learn Combined Over Under Sampling
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
A two-stage resampling strategy that first oversamples the minority class with SMOTE, then cleans the resulting data by removing noisy or ambiguous samples using an under-sampling technique.
Description
Combined over-and-under-sampling addresses the noise introduced by SMOTE by applying a cleaning step after oversampling. SMOTE can generate synthetic samples that overlap with majority class instances, creating ambiguous regions. By following SMOTE with a cleaning method such as Edited Nearest Neighbours (ENN) or Tomek Links, these noisy samples are removed.
Two primary variants exist:
- SMOTE + ENN: Removes any sample (from either class) whose class label differs from the majority of its nearest neighbors. This is a more aggressive cleaning approach.
- SMOTE + Tomek Links: Removes only Tomek link pairs (nearest-neighbor pairs from different classes), which is a gentler cleaning approach that removes borderline ambiguity.
Usage
Use this principle when:
- SMOTE alone introduces too much noise near the decision boundary
- Cleaner class boundaries are needed after oversampling
- A balance between oversampling and noise reduction is desired
- SMOTE+ENN for aggressive cleaning; SMOTE+Tomek for conservative cleaning
Theoretical Basis
Stage 1: Apply SMOTE to oversample the minority class.
Stage 2: Apply a cleaning rule:
- ENN cleaning: For each sample, find its k nearest neighbors. If the sample's class differs from the majority class of its neighbors, remove it.
- Tomek Links: For each pair of nearest neighbors from different classes, remove one or both to clean the boundary.
# Abstract combined resampling (NOT real implementation)
# Stage 1: Oversample
X_over, y_over = SMOTE().fit_resample(X, y)
# Stage 2: Clean
# ENN variant - remove misclassified by neighbors
for sample in X_over:
neighbors = k_nearest_neighbors(sample, k=3)
if majority_class(neighbors) != class(sample):
remove(sample)