Implementation:Scikit learn contrib Imbalanced learn OneSidedSelection

Knowledge Sources	imbalanced-learn imbalanced-learn Docs M. Kubat, S. Matwin, "Addressing the curse of imbalanced training sets: one-sided selection," ICML, 1997
Domains	Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated	2026-02-09 03:00 GMT

Overview

Under-sampling technique that combines Condensed Nearest Neighbour with Tomek Links removal to eliminate both redundant and noisy majority class samples in a two-phase cleaning process.

Description

The OneSidedSelection class implements the one-sided selection method for under-sampling majority class instances. It extends BaseCleaningSampler and operates in two phases: first, it applies the Condensed Nearest Neighbour (CNN) rule to identify and retain only those majority samples necessary for correct 1-NN classification (removing redundant interior samples); second, it applies Tomek Links removal to clean noisy borderline samples from the CNN output. This two-phase approach targets both redundant and noisy majority samples. The class integrates with scikit-learn's estimator API, supporting pipeline composition, parameter validation, and multi-class resampling via a one-vs.-rest scheme.

Usage

Import this class when you need a more thorough majority class reduction than CNN alone provides. Use it as a standalone resampler via fit_resample() or as a step in an imblearn.pipeline.Pipeline. One-Sided Selection is preferred over plain CNN when borderline noise is a concern, since the Tomek Links phase removes noisy pairs that CNN would retain.

Code Reference

Source Location

Repository: imbalanced-learn
File: imblearn/under_sampling/_prototype_selection/_one_sided_selection.py
Lines: L1-213

Signature

class OneSidedSelection(BaseCleaningSampler):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        n_neighbors=None,
        n_seeds_S=1,
        n_jobs=None,
    ):
        """
        Args:
            sampling_strategy: str, dict, or callable - Desired ratio of
                samples after resampling. 'auto' targets all classes except
                the minority class.
            random_state: int, RandomState, or None - Seed for reproducibility
                when selecting the initial majority seed samples.
            n_neighbors: int, KNeighborsClassifier, or None - Number of
                nearest neighbors for classification. None defaults to 1-NN.
            n_seeds_S: int - Number of initial majority samples to seed the
                condensed set (default: 1).
            n_jobs: int or None - Number of parallel jobs for the nearest
                neighbor classifier.
        """

Import

from imblearn.under_sampling import OneSidedSelection

I/O Contract

Inputs

Name	Type	Required	Description
X	{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)	Yes	Feature matrix of training data
y	array-like of shape (n_samples,)	Yes	Target labels indicating class membership
sampling_strategy	str, dict, or callable	No	Resampling target; 'auto' targets all classes except the minority
n_neighbors	int, KNeighborsClassifier, or None	No	Neighbor count or estimator for 1-NN classification (default: None, i.e. 1-NN)
n_seeds_S	int	No	Number of random majority seeds to initialize the condensed set (default: 1)
random_state	int, RandomState, or None	No	Random seed for reproducibility
n_jobs	int or None	No	Number of parallel jobs for the nearest neighbor search

Outputs

Name	Type	Description
X_resampled	{ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features)	Feature matrix with redundant and noisy majority samples removed
y_resampled	ndarray of shape (n_samples_new,)	Target array with corresponding labels for the cleaned subset

Attributes

Name	Type	Description
sampling_strategy_	dict	Maps class labels to the number of samples to remove
estimators_	list of KNeighborsClassifier	One fitted 1-NN estimator per resampled class (from the CNN phase)
sample_indices_	ndarray of shape (n_new_samples,)	Indices of selected samples from the original dataset (after both CNN and Tomek Links phases)
n_features_in_	int	Number of features seen during fit
feature_names_in_	ndarray of shape (n_features_in_,)	Feature names seen during fit (when X has string feature names)

Usage Examples

Basic Under-sampling

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import OneSidedSelection

# 1. Create an imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")

# 2. Apply One-Sided Selection
oss = OneSidedSelection(random_state=42)
X_resampled, y_resampled = oss.fit_resample(X, y)
print(f"Resampled: {Counter(y_resampled)}")

Inside a Pipeline

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import OneSidedSelection
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate

# Build pipeline with OSS + classifier
pipeline = make_pipeline(
    OneSidedSelection(random_state=42),
    LinearSVC(),
)

# Cross-validate (OSS applied only to training folds)
scores = cross_validate(pipeline, X, y, scoring="balanced_accuracy", cv=5)
print(f"Mean balanced accuracy: {scores['test_score'].mean():.3f}")

Custom Neighbor Count

from imblearn.under_sampling import OneSidedSelection

# Use 3-NN instead of the default 1-NN for the CNN phase
oss = OneSidedSelection(
    n_neighbors=3,
    n_seeds_S=1,
    random_state=42,
)
X_res, y_res = oss.fit_resample(X, y)

# Inspect which samples were retained after both phases
print(f"Retained indices: {oss.sample_indices_}")

Related Pages

Implements Principle

Principle:Scikit_learn_contrib_Imbalanced_learn_One_Sided_Selection

Requires Environment

Environment:Scikit_learn_contrib_Imbalanced_learn_Python_Scikit_learn

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment