Principle:Scikit learn contrib Imbalanced learn Synthetic Minority Oversampling

Knowledge Sources	SMOTE: Synthetic Minority Over-sampling Technique A Survey of Predictive Modelling under Imbalanced Distributions
Domains	Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated	2026-02-09 03:00 GMT

Overview

A synthetic data generation technique that creates new minority class samples by interpolating between existing minority class instances and their nearest neighbors.

Description

Synthetic Minority Oversampling Technique (SMOTE) addresses class imbalance by generating synthetic samples rather than duplicating existing minority instances. The core idea is to select a minority sample, find its k-nearest minority neighbors, and create new samples along the line segments connecting them. This approach avoids the overfitting problem associated with simple random oversampling, which merely replicates existing data points.

SMOTE was introduced by Chawla et al. (2002) and has since become one of the most widely adopted techniques for handling imbalanced datasets. The algorithm operates in feature space rather than data space, producing plausible new instances that expand the decision region of the minority class.

The technique supports multi-class problems through a one-vs-rest decomposition, where each minority class is oversampled independently against the remaining classes.

Usage

Use this principle when working with classification tasks where the minority class has significantly fewer samples than the majority class. SMOTE is appropriate when:

The minority class decision boundary needs to be expanded rather than simply reinforced
Random oversampling leads to overfitting on minority class instances
The feature space is continuous and numeric (for standard SMOTE; variants exist for categorical and mixed features)
The dataset is not extremely small (SMOTE needs enough minority samples to find meaningful neighbors)

Theoretical Basis

The SMOTE algorithm generates synthetic samples through linear interpolation in feature space:

For each minority class sample x_i:

Find its k nearest minority class neighbors
Randomly select one neighbor x_nn
Generate a synthetic sample: $x_{n e w} = x_{i} + λ \cdot (x_{n n} - x_{i})$ where $λ \sim U (0, 1)$

Pseudo-code:

# Abstract SMOTE algorithm (NOT real implementation)
for each minority_sample x_i:
    neighbors = k_nearest_minority_neighbors(x_i, k=5)
    for j in range(num_synthetic_needed):
        x_nn = random_choice(neighbors)
        lam = random_uniform(0, 1)
        x_new = x_i + lam * (x_nn - x_i)
        add_to_dataset(x_new)

The number of synthetic samples generated per minority instance is determined by the sampling_strategy parameter, which defines the desired class distribution after resampling.

Related Pages

Implemented By

Implementation:Scikit_learn_contrib_Imbalanced_learn_SMOTE

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment