Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Scikit learn contrib Imbalanced learn One Sided Selection

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

A two-phase under-sampling strategy that combines Condensed Nearest Neighbour for redundant sample removal with Tomek Links for noisy borderline sample removal, targeting both types of uninformative majority class samples.

Description

One-Sided Selection (OSS), introduced by Kubat and Matwin (1997), addresses the class imbalance problem by identifying and removing two distinct types of problematic majority class samples:

  1. Redundant samples: Majority instances located far from the decision boundary that contribute nothing to the classifier's discriminative ability. These are removed by the Condensed Nearest Neighbour (CNN) phase.
  2. Noisy/borderline samples: Majority instances that lie very close to minority samples, forming Tomek Links. These are removed by the Tomek Links cleaning phase.

By combining these two complementary techniques, OSS produces a cleaner and smaller majority class that retains only the informative samples near (but not too near) the decision boundary. This is a one-sided selection because only the majority class is reduced; the minority class remains untouched.

The technique supports multi-class problems through a one-vs.-rest decomposition, where each majority class is processed independently against the minority class.

Usage

Use this principle when the majority class suffers from both redundancy and borderline noise. One-Sided Selection is appropriate when:

  • Plain CNN retains too many noisy borderline majority samples
  • Tomek Links alone does not remove enough redundant interior majority samples
  • A two-phase cleaning strategy is acceptable in terms of computational cost
  • The goal is to improve classifier performance by removing both uninformative and misleading majority samples
  • The dataset is not extremely small (the CNN phase needs enough samples for meaningful 1-NN classification)

Theoretical Basis

One-Sided Selection operates in two sequential phases:

Phase 1 -- Condensed Nearest Neighbour (redundancy removal):

  1. Initialize: Start with set S containing all minority class samples plus n_seeds_S randomly selected majority class samples
  2. Classify: For each remaining majority sample, classify it using 1-NN on S
  3. Add misclassified: If a majority sample is misclassified, add it to S
  4. The result is a consistent subset where redundant interior majority samples have been removed

Phase 2 -- Tomek Links (noise removal):

  1. Identify Tomek Links: A pair (a, b) forms a Tomek Link if a and b belong to different classes and each is the other's nearest neighbor
  2. Remove majority partner: For each Tomek Link, remove the majority class sample
  3. This eliminates noisy borderline majority samples that survived the CNN phase

Pseudo-code:

# Abstract OSS algorithm (NOT real implementation)

# Phase 1: CNN -- remove redundant majority samples
S = all_minority_samples + random_majority_seeds(n_seeds_S)
for x in remaining_majority_samples:
    nn = nearest_neighbor(x, S)
    if label(nn) != label(x):
        S.add(x)

# Phase 2: Tomek Links -- remove noisy borderline samples
for (a, b) in all_pairs_in_S:
    if is_tomek_link(a, b):
        if label(a) == majority_class:
            S.remove(a)
        elif label(b) == majority_class:
            S.remove(b)

# S now contains minority samples + clean, non-redundant majority samples

Key properties:

  • Complementary phases: CNN targets redundant samples (far from the boundary) while Tomek Links targets noisy samples (too close to the boundary)
  • Minority preservation: Only majority class samples are removed; the minority class remains intact
  • Order of operations matters: CNN runs first to reduce the dataset size, making Tomek Links computation faster on the condensed set
  • More aggressive than CNN alone: The Tomek Links phase provides additional cleaning that CNN does not perform, since CNN retains noisy samples by design

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment