Principle:Scikit learn contrib Imbalanced learn Neighbourhood Cleaning Rule
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
A two-phase neighbourhood-based data cleaning technique that combines Edited Nearest Neighbours with K-NN-guided majority sample removal to improve minority class recognition.
Description
The Neighbourhood Cleaning Rule (NCR), proposed by Laurikkala (2001), is a data cleaning under-sampling method that focuses on improving the identification of difficult small classes. Rather than aiming for strict class balance, NCR prioritises removing majority-class samples that are either noisy (misclassified by their neighbourhood) or harmful to minority class recognition (causing misclassification of nearby minority samples).
NCR combines two complementary strategies into a unified cleaning approach. The first phase uses Edited Nearest Neighbours (ENN) to remove noisy majority samples. The second phase identifies majority samples in the neighbourhood of misclassified minority samples and removes them. The union of samples identified by both phases is removed from the dataset.
Usage
Use this principle when:
- The dataset contains noisy majority samples that overlap with minority class regions
- Standard ENN alone does not sufficiently clean the neighbourhood around minority samples
- You want a cleaning approach that explicitly protects minority class recognition
- A neighbourhood-based method is preferred over classifier-based approaches (like instance hardness thresholding)
Theoretical Basis
NCR operates in two phases, producing two sets of samples to remove: A1 and A2.
Phase 1 (A1): Edited Nearest Neighbours
Apply ENN to the dataset using the mode-based selection rule:
- For each sample x_i, find its k nearest neighbours
- If the majority class among x_i's neighbours disagrees with x_i's true label, and x_i belongs to a majority class, mark it for removal
- The set of all majority samples flagged by ENN forms A1
This phase identifies majority-class samples that are inconsistent with their local neighbourhood -- noise and borderline samples.
Phase 2 (A2): Minority-Guided Cleaning
Use K-NN to identify additional majority samples harming minority classification:
- Train a K-NN classifier on the full dataset
- For each minority sample x_j, predict its class using K-NN
- If x_j is misclassified (predicted class differs from true minority label), identify the k nearest neighbours of x_j
- Among those neighbours, mark majority-class samples for removal, but only if their class size exceeds a threshold: , where is the class count, is the total dataset size, and is the cleaning threshold
- The set of majority samples flagged in this phase forms A2
Final Removal
Compute the union and remove all identified samples from the dataset.
Pseudo-code:
# Abstract NCR algorithm (NOT real implementation)
# Phase 1: Apply ENN to find noisy majority samples
enn = EditedNearestNeighbours(kind_sel="mode", n_neighbors=k)
enn.fit_resample(X, y)
A1 = samples_removed_by_enn
# Phase 2: Minority-guided cleaning
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X, y)
A2 = set()
for x_j in minority_samples:
if knn.predict(x_j) != true_label(x_j):
# x_j is misclassified -- find its majority neighbours
neighbours = knn.kneighbors(x_j)
for neighbour in neighbours:
if class_of(neighbour) in classes_above_threshold:
A2.add(neighbour)
# Remove union of both sets
samples_to_remove = A1 | A2
X_clean = X[not in samples_to_remove]
y_clean = y[not in samples_to_remove]
Key Properties
- Two-phase cleaning: Combining ENN (global noise removal) with minority-guided neighbour removal (targeted protection) provides more thorough cleaning than either method alone.
- Threshold-controlled: The threshold_cleaning parameter controls which classes are eligible for Phase 2 removal. Only classes whose count exceeds Failed to parse (syntax error): {\displaystyle \text{minority\_count} \times \text{threshold}} are cleaned, preventing removal from already small classes.
- Non-balanced output: Unlike methods that target a specific class ratio, NCR removes only genuinely problematic samples. The resulting class distribution depends on the amount of noise and overlap in the data.
- Multi-class support: NCR uses a one-vs-rest scheme when handling multi-class problems, as proposed in the original paper.