Implementation:Scikit learn contrib Imbalanced learn OneSidedSelection
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
Under-sampling technique that combines Condensed Nearest Neighbour with Tomek Links removal to eliminate both redundant and noisy majority class samples in a two-phase cleaning process.
Description
The OneSidedSelection class implements the one-sided selection method for under-sampling majority class instances. It extends BaseCleaningSampler and operates in two phases: first, it applies the Condensed Nearest Neighbour (CNN) rule to identify and retain only those majority samples necessary for correct 1-NN classification (removing redundant interior samples); second, it applies Tomek Links removal to clean noisy borderline samples from the CNN output. This two-phase approach targets both redundant and noisy majority samples. The class integrates with scikit-learn's estimator API, supporting pipeline composition, parameter validation, and multi-class resampling via a one-vs.-rest scheme.
Usage
Import this class when you need a more thorough majority class reduction than CNN alone provides. Use it as a standalone resampler via fit_resample() or as a step in an imblearn.pipeline.Pipeline. One-Sided Selection is preferred over plain CNN when borderline noise is a concern, since the Tomek Links phase removes noisy pairs that CNN would retain.
Code Reference
Source Location
- Repository: imbalanced-learn
- File: imblearn/under_sampling/_prototype_selection/_one_sided_selection.py
- Lines: L1-213
Signature
class OneSidedSelection(BaseCleaningSampler):
def __init__(
self,
*,
sampling_strategy="auto",
random_state=None,
n_neighbors=None,
n_seeds_S=1,
n_jobs=None,
):
"""
Args:
sampling_strategy: str, dict, or callable - Desired ratio of
samples after resampling. 'auto' targets all classes except
the minority class.
random_state: int, RandomState, or None - Seed for reproducibility
when selecting the initial majority seed samples.
n_neighbors: int, KNeighborsClassifier, or None - Number of
nearest neighbors for classification. None defaults to 1-NN.
n_seeds_S: int - Number of initial majority samples to seed the
condensed set (default: 1).
n_jobs: int or None - Number of parallel jobs for the nearest
neighbor classifier.
"""
Import
from imblearn.under_sampling import OneSidedSelection
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| X | {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) | Yes | Feature matrix of training data |
| y | array-like of shape (n_samples,) | Yes | Target labels indicating class membership |
| sampling_strategy | str, dict, or callable | No | Resampling target; 'auto' targets all classes except the minority |
| n_neighbors | int, KNeighborsClassifier, or None | No | Neighbor count or estimator for 1-NN classification (default: None, i.e. 1-NN) |
| n_seeds_S | int | No | Number of random majority seeds to initialize the condensed set (default: 1) |
| random_state | int, RandomState, or None | No | Random seed for reproducibility |
| n_jobs | int or None | No | Number of parallel jobs for the nearest neighbor search |
Outputs
| Name | Type | Description |
|---|---|---|
| X_resampled | {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) | Feature matrix with redundant and noisy majority samples removed |
| y_resampled | ndarray of shape (n_samples_new,) | Target array with corresponding labels for the cleaned subset |
Attributes
| Name | Type | Description |
|---|---|---|
| sampling_strategy_ | dict | Maps class labels to the number of samples to remove |
| estimators_ | list of KNeighborsClassifier | One fitted 1-NN estimator per resampled class (from the CNN phase) |
| sample_indices_ | ndarray of shape (n_new_samples,) | Indices of selected samples from the original dataset (after both CNN and Tomek Links phases) |
| n_features_in_ | int | Number of features seen during fit |
| feature_names_in_ | ndarray of shape (n_features_in_,) | Feature names seen during fit (when X has string feature names) |
Usage Examples
Basic Under-sampling
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import OneSidedSelection
# 1. Create an imbalanced dataset
X, y = make_classification(
n_classes=2, class_sep=2, weights=[0.1, 0.9],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# 2. Apply One-Sided Selection
oss = OneSidedSelection(random_state=42)
X_resampled, y_resampled = oss.fit_resample(X, y)
print(f"Resampled: {Counter(y_resampled)}")
Inside a Pipeline
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import OneSidedSelection
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate
# Build pipeline with OSS + classifier
pipeline = make_pipeline(
OneSidedSelection(random_state=42),
LinearSVC(),
)
# Cross-validate (OSS applied only to training folds)
scores = cross_validate(pipeline, X, y, scoring="balanced_accuracy", cv=5)
print(f"Mean balanced accuracy: {scores['test_score'].mean():.3f}")
Custom Neighbor Count
from imblearn.under_sampling import OneSidedSelection
# Use 3-NN instead of the default 1-NN for the CNN phase
oss = OneSidedSelection(
n_neighbors=3,
n_seeds_S=1,
random_state=42,
)
X_res, y_res = oss.fit_resample(X, y)
# Inspect which samples were retained after both phases
print(f"Retained indices: {oss.sample_indices_}")