Principle:Scikit learn contrib Imbalanced learn One Sided Selection

Knowledge Sources	M. Kubat, S. Matwin, "Addressing the curse of imbalanced training sets: one-sided selection," ICML, vol. 97, pp. 179-186, 1997
Domains	Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated	2026-02-09 03:00 GMT

Overview

A two-phase under-sampling strategy that combines Condensed Nearest Neighbour for redundant sample removal with Tomek Links for noisy borderline sample removal, targeting both types of uninformative majority class samples.

Description

One-Sided Selection (OSS), introduced by Kubat and Matwin (1997), addresses the class imbalance problem by identifying and removing two distinct types of problematic majority class samples:

Redundant samples: Majority instances located far from the decision boundary that contribute nothing to the classifier's discriminative ability. These are removed by the Condensed Nearest Neighbour (CNN) phase.
Noisy/borderline samples: Majority instances that lie very close to minority samples, forming Tomek Links. These are removed by the Tomek Links cleaning phase.

By combining these two complementary techniques, OSS produces a cleaner and smaller majority class that retains only the informative samples near (but not too near) the decision boundary. This is a one-sided selection because only the majority class is reduced; the minority class remains untouched.

The technique supports multi-class problems through a one-vs.-rest decomposition, where each majority class is processed independently against the minority class.

Usage

Use this principle when the majority class suffers from both redundancy and borderline noise. One-Sided Selection is appropriate when:

Plain CNN retains too many noisy borderline majority samples
Tomek Links alone does not remove enough redundant interior majority samples
A two-phase cleaning strategy is acceptable in terms of computational cost
The goal is to improve classifier performance by removing both uninformative and misleading majority samples
The dataset is not extremely small (the CNN phase needs enough samples for meaningful 1-NN classification)

Theoretical Basis

One-Sided Selection operates in two sequential phases:

Phase 1 -- Condensed Nearest Neighbour (redundancy removal):

Initialize: Start with set S containing all minority class samples plus n_seeds_S randomly selected majority class samples
Classify: For each remaining majority sample, classify it using 1-NN on S
Add misclassified: If a majority sample is misclassified, add it to S
The result is a consistent subset where redundant interior majority samples have been removed

Phase 2 -- Tomek Links (noise removal):

Identify Tomek Links: A pair (a, b) forms a Tomek Link if a and b belong to different classes and each is the other's nearest neighbor
Remove majority partner: For each Tomek Link, remove the majority class sample
This eliminates noisy borderline majority samples that survived the CNN phase

Pseudo-code:

# Abstract OSS algorithm (NOT real implementation)

# Phase 1: CNN -- remove redundant majority samples
S = all_minority_samples + random_majority_seeds(n_seeds_S)
for x in remaining_majority_samples:
    nn = nearest_neighbor(x, S)
    if label(nn) != label(x):
        S.add(x)

# Phase 2: Tomek Links -- remove noisy borderline samples
for (a, b) in all_pairs_in_S:
    if is_tomek_link(a, b):
        if label(a) == majority_class:
            S.remove(a)
        elif label(b) == majority_class:
            S.remove(b)

# S now contains minority samples + clean, non-redundant majority samples

Key properties:

Complementary phases: CNN targets redundant samples (far from the boundary) while Tomek Links targets noisy samples (too close to the boundary)
Minority preservation: Only majority class samples are removed; the minority class remains intact
Order of operations matters: CNN runs first to reduce the dataset size, making Tomek Links computation faster on the condensed set
More aggressive than CNN alone: The Tomek Links phase provides additional cleaning that CNN does not perform, since CNN retains noisy samples by design

Related Pages

Implemented By

Implementation:Scikit_learn_contrib_Imbalanced_learn_OneSidedSelection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment