Implementation:Scikit learn contrib Imbalanced learn SMOTE
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
Concrete tool for generating synthetic minority class samples via nearest-neighbor interpolation provided by the imbalanced-learn library.
Description
The SMOTE class implements the Synthetic Minority Over-sampling Technique. It extends BaseSMOTE and generates new minority samples by interpolating between each minority instance and its k-nearest neighbors. The class integrates with scikit-learn's estimator API, supporting pipeline composition, parameter validation, and metadata routing.
Usage
Import this class when you need to balance a dataset with continuous numeric features before training a classifier. Use it as a standalone resampler via fit_resample() or as a step in an imblearn.pipeline.Pipeline.
Code Reference
Source Location
- Repository: imbalanced-learn
- File: imblearn/over_sampling/_smote/base.py
- Lines: L242-380
Signature
class SMOTE(BaseSMOTE):
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
k_neighbors=5,
):
"""
Args:
sampling_strategy: str, dict, or callable - Desired ratio of
minority to majority samples. 'auto' equalizes all classes.
random_state: int, RandomState, or None - Seed for reproducibility.
k_neighbors: int or NearestNeighbors instance - Number of nearest
neighbors used to generate synthetic samples (default: 5).
"""
Import
from imblearn.over_sampling import SMOTE
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| X | {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) | Yes | Feature matrix of training data |
| y | array-like of shape (n_samples,) | Yes | Target labels indicating class membership |
| sampling_strategy | str, dict, or callable | No | Resampling ratio; 'auto' equalizes all classes |
| k_neighbors | int or NearestNeighbors | No | Nearest neighbors for interpolation (default: 5) |
| random_state | int, RandomState, or None | No | Random seed for reproducibility |
Outputs
| Name | Type | Description |
|---|---|---|
| X_resampled | {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) | Feature matrix with synthetic minority samples added |
| y_resampled | ndarray of shape (n_samples_new,) | Target array with corresponding labels for synthetic samples |
Usage Examples
Basic Oversampling
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
# 1. Create an imbalanced dataset
X, y = make_classification(
n_classes=2, class_sep=2, weights=[0.1, 0.9],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# 2. Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)
print(f"Resampled: {Counter(y_resampled)}")
Inside a Pipeline
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate
# Build pipeline with SMOTE + classifier
pipeline = make_pipeline(SMOTE(random_state=42), LinearSVC())
# Cross-validate (SMOTE applied only to training folds)
scores = cross_validate(pipeline, X, y, scoring="balanced_accuracy", cv=5)
print(f"Mean balanced accuracy: {scores['test_score'].mean():.3f}")
Custom Sampling Strategy
from imblearn.over_sampling import SMOTE
# Specify exact number of samples per class
smote = SMOTE(
sampling_strategy={0: 500}, # Generate 500 total minority samples
k_neighbors=3,
random_state=42,
)
X_res, y_res = smote.fit_resample(X, y)