Principle:Scikit learn contrib Imbalanced learn One Sided Selection
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
A two-phase under-sampling strategy that combines Condensed Nearest Neighbour for redundant sample removal with Tomek Links for noisy borderline sample removal, targeting both types of uninformative majority class samples.
Description
One-Sided Selection (OSS), introduced by Kubat and Matwin (1997), addresses the class imbalance problem by identifying and removing two distinct types of problematic majority class samples:
- Redundant samples: Majority instances located far from the decision boundary that contribute nothing to the classifier's discriminative ability. These are removed by the Condensed Nearest Neighbour (CNN) phase.
- Noisy/borderline samples: Majority instances that lie very close to minority samples, forming Tomek Links. These are removed by the Tomek Links cleaning phase.
By combining these two complementary techniques, OSS produces a cleaner and smaller majority class that retains only the informative samples near (but not too near) the decision boundary. This is a one-sided selection because only the majority class is reduced; the minority class remains untouched.
The technique supports multi-class problems through a one-vs.-rest decomposition, where each majority class is processed independently against the minority class.
Usage
Use this principle when the majority class suffers from both redundancy and borderline noise. One-Sided Selection is appropriate when:
- Plain CNN retains too many noisy borderline majority samples
- Tomek Links alone does not remove enough redundant interior majority samples
- A two-phase cleaning strategy is acceptable in terms of computational cost
- The goal is to improve classifier performance by removing both uninformative and misleading majority samples
- The dataset is not extremely small (the CNN phase needs enough samples for meaningful 1-NN classification)
Theoretical Basis
One-Sided Selection operates in two sequential phases:
Phase 1 -- Condensed Nearest Neighbour (redundancy removal):
- Initialize: Start with set S containing all minority class samples plus n_seeds_S randomly selected majority class samples
- Classify: For each remaining majority sample, classify it using 1-NN on S
- Add misclassified: If a majority sample is misclassified, add it to S
- The result is a consistent subset where redundant interior majority samples have been removed
Phase 2 -- Tomek Links (noise removal):
- Identify Tomek Links: A pair (a, b) forms a Tomek Link if a and b belong to different classes and each is the other's nearest neighbor
- Remove majority partner: For each Tomek Link, remove the majority class sample
- This eliminates noisy borderline majority samples that survived the CNN phase
Pseudo-code:
# Abstract OSS algorithm (NOT real implementation)
# Phase 1: CNN -- remove redundant majority samples
S = all_minority_samples + random_majority_seeds(n_seeds_S)
for x in remaining_majority_samples:
nn = nearest_neighbor(x, S)
if label(nn) != label(x):
S.add(x)
# Phase 2: Tomek Links -- remove noisy borderline samples
for (a, b) in all_pairs_in_S:
if is_tomek_link(a, b):
if label(a) == majority_class:
S.remove(a)
elif label(b) == majority_class:
S.remove(b)
# S now contains minority samples + clean, non-redundant majority samples
Key properties:
- Complementary phases: CNN targets redundant samples (far from the boundary) while Tomek Links targets noisy samples (too close to the boundary)
- Minority preservation: Only majority class samples are removed; the minority class remains intact
- Order of operations matters: CNN runs first to reduce the dataset size, making Tomek Links computation faster on the condensed set
- More aggressive than CNN alone: The Tomek Links phase provides additional cleaning that CNN does not perform, since CNN retains noisy samples by design