Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Scikit learn contrib Imbalanced learn OneSidedSelection

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

Under-sampling technique that combines Condensed Nearest Neighbour with Tomek Links removal to eliminate both redundant and noisy majority class samples in a two-phase cleaning process.

Description

The OneSidedSelection class implements the one-sided selection method for under-sampling majority class instances. It extends BaseCleaningSampler and operates in two phases: first, it applies the Condensed Nearest Neighbour (CNN) rule to identify and retain only those majority samples necessary for correct 1-NN classification (removing redundant interior samples); second, it applies Tomek Links removal to clean noisy borderline samples from the CNN output. This two-phase approach targets both redundant and noisy majority samples. The class integrates with scikit-learn's estimator API, supporting pipeline composition, parameter validation, and multi-class resampling via a one-vs.-rest scheme.

Usage

Import this class when you need a more thorough majority class reduction than CNN alone provides. Use it as a standalone resampler via fit_resample() or as a step in an imblearn.pipeline.Pipeline. One-Sided Selection is preferred over plain CNN when borderline noise is a concern, since the Tomek Links phase removes noisy pairs that CNN would retain.

Code Reference

Source Location

  • Repository: imbalanced-learn
  • File: imblearn/under_sampling/_prototype_selection/_one_sided_selection.py
  • Lines: L1-213

Signature

class OneSidedSelection(BaseCleaningSampler):
    def __init__(
        self,
        *,
        sampling_strategy="auto",
        random_state=None,
        n_neighbors=None,
        n_seeds_S=1,
        n_jobs=None,
    ):
        """
        Args:
            sampling_strategy: str, dict, or callable - Desired ratio of
                samples after resampling. 'auto' targets all classes except
                the minority class.
            random_state: int, RandomState, or None - Seed for reproducibility
                when selecting the initial majority seed samples.
            n_neighbors: int, KNeighborsClassifier, or None - Number of
                nearest neighbors for classification. None defaults to 1-NN.
            n_seeds_S: int - Number of initial majority samples to seed the
                condensed set (default: 1).
            n_jobs: int or None - Number of parallel jobs for the nearest
                neighbor classifier.
        """

Import

from imblearn.under_sampling import OneSidedSelection

I/O Contract

Inputs

Name Type Required Description
X {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) Yes Feature matrix of training data
y array-like of shape (n_samples,) Yes Target labels indicating class membership
sampling_strategy str, dict, or callable No Resampling target; 'auto' targets all classes except the minority
n_neighbors int, KNeighborsClassifier, or None No Neighbor count or estimator for 1-NN classification (default: None, i.e. 1-NN)
n_seeds_S int No Number of random majority seeds to initialize the condensed set (default: 1)
random_state int, RandomState, or None No Random seed for reproducibility
n_jobs int or None No Number of parallel jobs for the nearest neighbor search

Outputs

Name Type Description
X_resampled {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) Feature matrix with redundant and noisy majority samples removed
y_resampled ndarray of shape (n_samples_new,) Target array with corresponding labels for the cleaned subset

Attributes

Name Type Description
sampling_strategy_ dict Maps class labels to the number of samples to remove
estimators_ list of KNeighborsClassifier One fitted 1-NN estimator per resampled class (from the CNN phase)
sample_indices_ ndarray of shape (n_new_samples,) Indices of selected samples from the original dataset (after both CNN and Tomek Links phases)
n_features_in_ int Number of features seen during fit
feature_names_in_ ndarray of shape (n_features_in_,) Feature names seen during fit (when X has string feature names)

Usage Examples

Basic Under-sampling

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import OneSidedSelection

# 1. Create an imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")

# 2. Apply One-Sided Selection
oss = OneSidedSelection(random_state=42)
X_resampled, y_resampled = oss.fit_resample(X, y)
print(f"Resampled: {Counter(y_resampled)}")

Inside a Pipeline

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import OneSidedSelection
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate

# Build pipeline with OSS + classifier
pipeline = make_pipeline(
    OneSidedSelection(random_state=42),
    LinearSVC(),
)

# Cross-validate (OSS applied only to training folds)
scores = cross_validate(pipeline, X, y, scoring="balanced_accuracy", cv=5)
print(f"Mean balanced accuracy: {scores['test_score'].mean():.3f}")

Custom Neighbor Count

from imblearn.under_sampling import OneSidedSelection

# Use 3-NN instead of the default 1-NN for the CNN phase
oss = OneSidedSelection(
    n_neighbors=3,
    n_seeds_S=1,
    random_state=42,
)
X_res, y_res = oss.fit_resample(X, y)

# Inspect which samples were retained after both phases
print(f"Retained indices: {oss.sample_indices_}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment