Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Scikit learn contrib Imbalanced learn Neighbourhood Cleaning Rule

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

A two-phase neighbourhood-based data cleaning technique that combines Edited Nearest Neighbours with K-NN-guided majority sample removal to improve minority class recognition.

Description

The Neighbourhood Cleaning Rule (NCR), proposed by Laurikkala (2001), is a data cleaning under-sampling method that focuses on improving the identification of difficult small classes. Rather than aiming for strict class balance, NCR prioritises removing majority-class samples that are either noisy (misclassified by their neighbourhood) or harmful to minority class recognition (causing misclassification of nearby minority samples).

NCR combines two complementary strategies into a unified cleaning approach. The first phase uses Edited Nearest Neighbours (ENN) to remove noisy majority samples. The second phase identifies majority samples in the neighbourhood of misclassified minority samples and removes them. The union of samples identified by both phases is removed from the dataset.

Usage

Use this principle when:

  • The dataset contains noisy majority samples that overlap with minority class regions
  • Standard ENN alone does not sufficiently clean the neighbourhood around minority samples
  • You want a cleaning approach that explicitly protects minority class recognition
  • A neighbourhood-based method is preferred over classifier-based approaches (like instance hardness thresholding)

Theoretical Basis

NCR operates in two phases, producing two sets of samples to remove: A1 and A2.

Phase 1 (A1): Edited Nearest Neighbours

Apply ENN to the dataset using the mode-based selection rule:

  1. For each sample x_i, find its k nearest neighbours
  2. If the majority class among x_i's neighbours disagrees with x_i's true label, and x_i belongs to a majority class, mark it for removal
  3. The set of all majority samples flagged by ENN forms A1

This phase identifies majority-class samples that are inconsistent with their local neighbourhood -- noise and borderline samples.

Phase 2 (A2): Minority-Guided Cleaning

Use K-NN to identify additional majority samples harming minority classification:

  1. Train a K-NN classifier on the full dataset
  2. For each minority sample x_j, predict its class using K-NN
  3. If x_j is misclassified (predicted class differs from true minority label), identify the k nearest neighbours of x_j
  4. Among those neighbours, mark majority-class samples for removal, but only if their class size exceeds a threshold: Ci>C×T, where Ci is the class count, C is the total dataset size, and T is the cleaning threshold
  5. The set of majority samples flagged in this phase forms A2

Final Removal

Compute the union A1A2 and remove all identified samples from the dataset.

Pseudo-code:

# Abstract NCR algorithm (NOT real implementation)
# Phase 1: Apply ENN to find noisy majority samples
enn = EditedNearestNeighbours(kind_sel="mode", n_neighbors=k)
enn.fit_resample(X, y)
A1 = samples_removed_by_enn

# Phase 2: Minority-guided cleaning
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X, y)

A2 = set()
for x_j in minority_samples:
    if knn.predict(x_j) != true_label(x_j):
        # x_j is misclassified -- find its majority neighbours
        neighbours = knn.kneighbors(x_j)
        for neighbour in neighbours:
            if class_of(neighbour) in classes_above_threshold:
                A2.add(neighbour)

# Remove union of both sets
samples_to_remove = A1 | A2
X_clean = X[not in samples_to_remove]
y_clean = y[not in samples_to_remove]

Key Properties

  • Two-phase cleaning: Combining ENN (global noise removal) with minority-guided neighbour removal (targeted protection) provides more thorough cleaning than either method alone.
  • Threshold-controlled: The threshold_cleaning parameter controls which classes are eligible for Phase 2 removal. Only classes whose count exceeds Failed to parse (syntax error): {\displaystyle \text{minority\_count} \times \text{threshold}} are cleaned, preventing removal from already small classes.
  • Non-balanced output: Unlike methods that target a specific class ratio, NCR removes only genuinely problematic samples. The resulting class distribution depends on the amount of noise and overlap in the data.
  • Multi-class support: NCR uses a one-vs-rest scheme when handling multi-class problems, as proposed in the original paper.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment