Principle:Scikit learn contrib Imbalanced learn NearMiss Under Sampling
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Under_Sampling, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
A distance-based under-sampling technique that selects majority class samples based on their proximity to minority class samples.
Description
NearMiss is a family of under-sampling algorithms introduced by Mani and Zhang (2003) that reduce the majority class by selecting samples based on their distance to minority class instances. Unlike random under-sampling, which discards majority samples without regard for their informational value, NearMiss methods use nearest-neighbor distances to make informed decisions about which majority samples to retain.
The key insight is that majority samples near the decision boundary between classes carry the most information for learning a classifier. The three NearMiss variants each formalize a different notion of "near the decision boundary":
- NearMiss-1: For each majority sample, compute the average distance to its k closest minority neighbors. Select the majority samples with the smallest such average distance. This keeps majority samples that are nearest to the minority class, focusing on the boundary region.
- NearMiss-2: For each majority sample, compute the average distance to its k farthest minority neighbors. Select the majority samples with the smallest such average distance. This keeps majority samples that are close to all minority samples, not just the nearest ones. The effect is to retain majority samples that are centrally located relative to the minority class distribution.
- NearMiss-3: A two-phase algorithm. In the first phase, for each minority sample, identify its m nearest majority neighbors, forming a candidate subset of majority samples that are close to at least one minority instance. In the second phase, from this candidate subset, select the majority samples whose average distance to their k nearest minority neighbors is largest. This retains majority samples that, while still near the boundary, provide the widest margin around minority instances.
All three variants support multi-class resampling, where each majority class is under-sampled independently against the minority class.
Usage
Use this principle when working with classification tasks where the majority class significantly outnumbers the minority class and you want to reduce the dataset size rather than generate synthetic samples. NearMiss is appropriate when:
- You want an informed under-sampling approach that preserves the most relevant majority samples near the class boundary
- The dataset is large enough that discarding majority samples does not result in an impractically small training set
- The feature space is continuous and numeric, since the algorithm relies on distance metrics
- You want to avoid the noise sensitivity of random under-sampling, which may remove informative boundary samples
- You need deterministic results (NearMiss does not involve random selection, unlike random under-sampling)
NearMiss-1 is the default choice and works well for general purposes. NearMiss-2 is more conservative, retaining majority samples that are globally close to the minority distribution. NearMiss-3 is useful when you want to ensure adequate margin around minority instances.
Theoretical Basis
The NearMiss algorithms are grounded in the k-nearest neighbor (kNN) framework. Given a training set with majority class samples X_maj and minority class samples X_min, a kNN model is first fitted on the minority class.
NearMiss-1
For each majority sample x_i in X_maj, compute the distance to its k nearest neighbors in X_min:
score(x_i) = (1/k) * sum( dist(x_i, nn_j) for j in 1..k )
where nn_1, ..., nn_k are the k nearest minority neighbors. Select the majority samples with the smallest scores.
Pseudo-code:
# Abstract NearMiss-1 algorithm (NOT real implementation)
nn_model.fit(X_minority)
for each majority_sample x_i:
distances = nn_model.kneighbors(x_i, k=n_neighbors)
score[x_i] = mean(distances)
selected = majority_samples_with_smallest_scores(n_samples)
NearMiss-2
For each majority sample x_i in X_maj, compute the distance to its k farthest neighbors in X_min (i.e., query all minority samples and take the k largest distances):
score(x_i) = (1/k) * sum( dist(x_i, nn_j) for j in (N_min - k + 1)..N_min )
where distances are sorted in ascending order. Select the majority samples with the smallest scores. This selects majority samples that are close even to the farthest minority points, meaning they are centrally placed relative to the entire minority distribution.
NearMiss-3
Phase 1 (Candidate Selection): For each minority sample x_min, find its m nearest majority neighbors. The union of all such neighbors forms the candidate set C.
Phase 2 (Final Selection): Fit the kNN model on the minority class and compute distances from each candidate in C to its k nearest minority neighbors. Select the candidates with the largest average distance, thereby retaining majority samples that provide the widest margin.
# Abstract NearMiss-3 algorithm (NOT real implementation)
# Phase 1: Build candidate set
nn_ver3.fit(X_majority)
candidates = set()
for each minority_sample x_min:
neighbors = nn_ver3.kneighbors(x_min, k=n_neighbors_ver3)
candidates.update(neighbors)
# Phase 2: Select from candidates by farthest distance
nn_model.fit(X_minority)
for each candidate x_c in candidates:
distances = nn_model.kneighbors(x_c, k=n_neighbors)
score[x_c] = mean(distances)
selected = candidates_with_largest_scores(n_samples)
The number of majority samples retained is determined by the sampling_strategy parameter, which defines the desired class distribution after resampling.