Principle:Scikit learn contrib Imbalanced learn Condensed Nearest Neighbour
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
A prototype selection technique that produces a minimal consistent subset of the training set such that every sample in the original set is correctly classified by a 1-nearest-neighbor rule using only the subset.
Description
The Condensed Nearest Neighbour (CNN) rule, introduced by Hart (1968), addresses the storage and computational cost of the k-nearest-neighbor classifier by reducing the training set to a smaller consistent subset. A subset S of the training set T is called consistent if every sample in T is correctly classified by a 1-NN classifier trained on S alone.
In the context of imbalanced learning, CNN is repurposed as an under-sampling strategy: it removes redundant majority class samples that reside far from the decision boundary, retaining only those majority samples that are necessary for correct classification. This naturally preserves samples near class boundaries while discarding interior majority samples.
The technique supports multi-class problems through a one-vs.-rest decomposition, where each majority class is condensed independently against the minority class.
Usage
Use this principle when the majority class contains large homogeneous regions of samples that add no discriminative value. CNN is appropriate when:
- The goal is to reduce the majority class while preserving decision boundary information
- Redundant majority samples far from the boundary inflate training time without improving accuracy
- A non-heuristic, convergence-guaranteed reduction strategy is preferred over random under-sampling
- The dataset is not heavily noisy (CNN retains noisy samples since they are typically misclassified)
Theoretical Basis
The CNN algorithm constructs a consistent subset through an iterative process:
Algorithm:
- Initialize: Start with set S containing all minority class samples plus n_seeds_S randomly selected majority class samples
- Classify: For each remaining majority sample x not in S, classify it using 1-NN on the current set S
- Add misclassified: If x is misclassified (i.e., the nearest neighbor in S belongs to a different class), add x to S
- Repeat: Continue until no more samples are added to S (convergence)
Pseudo-code:
# Abstract CNN algorithm (NOT real implementation)
S = all_minority_samples + random_majority_seeds(n_seeds_S)
changed = True
while changed:
changed = False
for x in majority_samples_not_in_S:
nn = nearest_neighbor(x, S)
if label(nn) != label(x):
S.add(x)
changed = True
# S is now a consistent subset of T
Key properties:
- Convergence: The algorithm is guaranteed to terminate because S can only grow and is bounded by the size of T
- Consistency: Upon termination, every sample in T is correctly classified by 1-NN using S
- Order dependence: The resulting subset depends on the order in which samples are processed and the initial random seed
- Sensitivity to noise: Noisy samples (mislabeled or near the boundary) are typically retained, as they are misclassified by 1-NN and thus added to S