Implementation:Scikit learn contrib Imbalanced learn NearMiss
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Under_Sampling, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
Concrete tool for performing under-sampling of majority classes based on nearest-neighbor distance methods provided by the imbalanced-learn library.
Description
The NearMiss class implements three distance-based under-sampling strategies that reduce majority class samples by selecting those closest to (or most relevant to) the minority class boundary. It extends BaseUnderSampler and integrates with scikit-learn's estimator API, supporting pipeline composition, parameter validation, and metadata routing.
Three algorithm variants are available:
- NearMiss-1 selects majority class samples whose average distance to their k nearest minority neighbors is smallest. This retains majority samples that are close to the minority class boundary.
- NearMiss-2 selects majority class samples whose average distance to their k farthest minority neighbors is smallest. This retains majority samples that are close to all minority samples, not just the nearest ones.
- NearMiss-3 operates in two phases. First, for each minority sample, the algorithm identifies the k nearest majority neighbors, forming a candidate subset. Then from this candidate subset, it selects the majority samples whose average distance to their k nearest minority neighbors is largest. This keeps majority samples that provide the widest margin around minority instances.
The algorithm fits a k-nearest neighbors model on the minority class and computes distances between majority class samples and minority class samples. These distances drive the selection of which majority samples to retain.
Usage
Import this class when you need to reduce the number of majority class samples to balance a dataset before training a classifier. Use it as a standalone resampler via fit_resample() or as a step in an imblearn.pipeline.Pipeline. NearMiss is especially useful when you want an informed under-sampling strategy that considers the geometric relationship between majority and minority classes, rather than removing majority samples at random.
Code Reference
Source Location
- Repository: Scikit_learn_contrib_Imbalanced_learn
- File: imblearn/under_sampling/_prototype_selection/_nearmiss.py
- Lines: 24-322
Signature
class NearMiss(BaseUnderSampler):
def __init__(
self,
*,
sampling_strategy="auto",
version=1,
n_neighbors=3,
n_neighbors_ver3=3,
n_jobs=None,
):
"""
Args:
sampling_strategy: str, float, dict, or callable - Desired ratio of
minority to majority samples. 'auto' equalizes all classes.
version: int (1, 2, or 3) - Version of the NearMiss algorithm to use.
n_neighbors: int or KNeighborsMixin estimator - Number of nearest
neighbors used to compute average distances (default: 3).
n_neighbors_ver3: int or KNeighborsMixin estimator - Number of
nearest neighbors used in the initial candidate selection phase
of NearMiss-3 (default: 3).
n_jobs: int or None - Number of parallel jobs for nearest neighbor
search. None means 1 unless inside a joblib.parallel_backend.
"""
Import
from imblearn.under_sampling import NearMiss
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| X | {array-like, sparse matrix, dataframe} of shape (n_samples, n_features) | Yes | Feature matrix of training data |
| y | array-like of shape (n_samples,) | Yes | Target labels indicating class membership |
| sampling_strategy | str, float, dict, or callable | No | Resampling ratio; 'auto' equalizes all classes (default: 'auto') |
| version | int (1, 2, or 3) | No | NearMiss algorithm version to use (default: 1) |
| n_neighbors | int or KNeighborsMixin estimator | No | Number of nearest neighbors for distance computation (default: 3) |
| n_neighbors_ver3 | int or KNeighborsMixin estimator | No | Number of nearest neighbors for the candidate selection phase in NearMiss-3 (default: 3) |
| n_jobs | int or None | No | Number of parallel jobs for nearest neighbor search (default: None) |
Outputs
| Name | Type | Description |
|---|---|---|
| X_resampled | {ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features) | Feature matrix with majority class samples reduced |
| y_resampled | ndarray of shape (n_samples_new,) | Target array with corresponding labels after under-sampling |
Attributes
| Name | Type | Description |
|---|---|---|
| sampling_strategy_ | dict | Dictionary mapping class labels to the number of samples to select for each class |
| nn_ | estimator object | Validated K-nearest Neighbours estimator created from the n_neighbors parameter |
| nn_ver3_ | estimator object | Validated K-nearest Neighbours estimator created from the n_neighbors_ver3 parameter (only set when version=3) |
| sample_indices_ | ndarray of shape (n_new_samples,) | Indices of the samples selected from the original dataset |
| n_features_in_ | int | Number of features in the input dataset |
| feature_names_in_ | ndarray of shape (n_features_in_,) | Names of features seen during fit (only when X has string feature names) |
Usage Examples
Basic Under-Sampling with NearMiss-1
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import NearMiss
# 1. Create an imbalanced dataset
X, y = make_classification(
n_classes=2, class_sep=2, weights=[0.1, 0.9],
n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1,
n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# Original: Counter({1: 900, 0: 100})
# 2. Apply NearMiss-1 (default)
nm = NearMiss()
X_resampled, y_resampled = nm.fit_resample(X, y)
print(f"Resampled: {Counter(y_resampled)}")
# Resampled: Counter({0: 100, 1: 100})
Selecting a NearMiss Version
from imblearn.under_sampling import NearMiss
# NearMiss-2: selects majority samples closest to farthest minority neighbors
nm2 = NearMiss(version=2, n_neighbors=5)
X_res2, y_res2 = nm2.fit_resample(X, y)
# NearMiss-3: two-phase selection with candidate pre-filtering
nm3 = NearMiss(version=3, n_neighbors=3, n_neighbors_ver3=3)
X_res3, y_res3 = nm3.fit_resample(X, y)
Inside a Pipeline
from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import NearMiss
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_validate
# Build pipeline with NearMiss + classifier
pipeline = make_pipeline(NearMiss(version=1), LinearSVC())
# Cross-validate (NearMiss applied only to training folds)
scores = cross_validate(pipeline, X, y, scoring="balanced_accuracy", cv=5)
print(f"Mean balanced accuracy: {scores['test_score'].mean():.3f}")
Custom Sampling Strategy
from imblearn.under_sampling import NearMiss
# Specify exact number of samples to retain per majority class
nm = NearMiss(
sampling_strategy={1: 200}, # Retain 200 majority class samples
version=1,
n_neighbors=5,
)
X_res, y_res = nm.fit_resample(X, y)
Accessing Selected Sample Indices
from imblearn.under_sampling import NearMiss
nm = NearMiss(version=1)
X_res, y_res = nm.fit_resample(X, y)
# Retrieve indices of samples retained from the original dataset
selected_indices = nm.sample_indices_
print(f"Number of selected samples: {len(selected_indices)}")