Principle:Scikit learn contrib Imbalanced learn Combined Over Under Sampling Tomek
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
A conservative two-stage resampling strategy that oversamples with SMOTE then removes Tomek link pairs to clean only the most ambiguous boundary samples.
Description
SMOTE + Tomek Links first applies SMOTE to balance the dataset, then identifies Tomek links: pairs of samples from different classes that are each other's nearest neighbors. Removing these pairs eliminates the most ambiguous boundary cases. Compared to SMOTE+ENN, this approach is more conservative, removing fewer samples and preserving more of the oversampled data.
Usage
Use this principle when a gentle cleaning step after SMOTE is preferred. Tomek Links removal only targets direct nearest-neighbor conflicts, making it suitable when aggressive data removal is undesirable.
Theoretical Basis
A Tomek link exists between samples a and b if:
- They belong to different classes
- There is no sample c such that d(a,c) < d(a,b) or d(b,c) < d(a,b)
In other words, a and b are each other's nearest neighbor despite being from different classes.
# Abstract SMOTE+Tomek algorithm (NOT real implementation)
X_over, y_over = SMOTE().fit_resample(X, y)
for each pair (a, b) where class(a) != class(b):
if nearest_neighbor(a) == b and nearest_neighbor(b) == a:
remove(a, b) # Tomek link pair