Principle:Scikit learn contrib Imbalanced learn Synthetic Minority Oversampling
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
A synthetic data generation technique that creates new minority class samples by interpolating between existing minority class instances and their nearest neighbors.
Description
Synthetic Minority Oversampling Technique (SMOTE) addresses class imbalance by generating synthetic samples rather than duplicating existing minority instances. The core idea is to select a minority sample, find its k-nearest minority neighbors, and create new samples along the line segments connecting them. This approach avoids the overfitting problem associated with simple random oversampling, which merely replicates existing data points.
SMOTE was introduced by Chawla et al. (2002) and has since become one of the most widely adopted techniques for handling imbalanced datasets. The algorithm operates in feature space rather than data space, producing plausible new instances that expand the decision region of the minority class.
The technique supports multi-class problems through a one-vs-rest decomposition, where each minority class is oversampled independently against the remaining classes.
Usage
Use this principle when working with classification tasks where the minority class has significantly fewer samples than the majority class. SMOTE is appropriate when:
- The minority class decision boundary needs to be expanded rather than simply reinforced
- Random oversampling leads to overfitting on minority class instances
- The feature space is continuous and numeric (for standard SMOTE; variants exist for categorical and mixed features)
- The dataset is not extremely small (SMOTE needs enough minority samples to find meaningful neighbors)
Theoretical Basis
The SMOTE algorithm generates synthetic samples through linear interpolation in feature space:
For each minority class sample x_i:
- Find its k nearest minority class neighbors
- Randomly select one neighbor x_nn
- Generate a synthetic sample: where
Pseudo-code:
# Abstract SMOTE algorithm (NOT real implementation)
for each minority_sample x_i:
neighbors = k_nearest_minority_neighbors(x_i, k=5)
for j in range(num_synthetic_needed):
x_nn = random_choice(neighbors)
lam = random_uniform(0, 1)
x_new = x_i + lam * (x_nn - x_i)
add_to_dataset(x_new)
The number of synthetic samples generated per minority instance is determined by the sampling_strategy parameter, which defines the desired class distribution after resampling.