Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Scikit learn contrib Imbalanced learn Condensed Nearest Neighbour

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

A prototype selection technique that produces a minimal consistent subset of the training set such that every sample in the original set is correctly classified by a 1-nearest-neighbor rule using only the subset.

Description

The Condensed Nearest Neighbour (CNN) rule, introduced by Hart (1968), addresses the storage and computational cost of the k-nearest-neighbor classifier by reducing the training set to a smaller consistent subset. A subset S of the training set T is called consistent if every sample in T is correctly classified by a 1-NN classifier trained on S alone.

In the context of imbalanced learning, CNN is repurposed as an under-sampling strategy: it removes redundant majority class samples that reside far from the decision boundary, retaining only those majority samples that are necessary for correct classification. This naturally preserves samples near class boundaries while discarding interior majority samples.

The technique supports multi-class problems through a one-vs.-rest decomposition, where each majority class is condensed independently against the minority class.

Usage

Use this principle when the majority class contains large homogeneous regions of samples that add no discriminative value. CNN is appropriate when:

  • The goal is to reduce the majority class while preserving decision boundary information
  • Redundant majority samples far from the boundary inflate training time without improving accuracy
  • A non-heuristic, convergence-guaranteed reduction strategy is preferred over random under-sampling
  • The dataset is not heavily noisy (CNN retains noisy samples since they are typically misclassified)

Theoretical Basis

The CNN algorithm constructs a consistent subset through an iterative process:

Algorithm:

  1. Initialize: Start with set S containing all minority class samples plus n_seeds_S randomly selected majority class samples
  2. Classify: For each remaining majority sample x not in S, classify it using 1-NN on the current set S
  3. Add misclassified: If x is misclassified (i.e., the nearest neighbor in S belongs to a different class), add x to S
  4. Repeat: Continue until no more samples are added to S (convergence)

Pseudo-code:

# Abstract CNN algorithm (NOT real implementation)
S = all_minority_samples + random_majority_seeds(n_seeds_S)

changed = True
while changed:
    changed = False
    for x in majority_samples_not_in_S:
        nn = nearest_neighbor(x, S)
        if label(nn) != label(x):
            S.add(x)
            changed = True

# S is now a consistent subset of T

Key properties:

  • Convergence: The algorithm is guaranteed to terminate because S can only grow and is bounded by the size of T
  • Consistency: Upon termination, every sample in T is correctly classified by 1-NN using S
  • Order dependence: The resulting subset depends on the order in which samples are processed and the initial random seed
  • Sensitivity to noise: Noisy samples (mislabeled or near the boundary) are typically retained, as they are misclassified by 1-NN and thus added to S

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment