Implementation:Scikit learn contrib Imbalanced learn InstanceHardnessThreshold

Knowledge Sources	imbalanced-learn imbalanced-learn Docs Smith et al. 2014
Domains	Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated	2026-02-09 03:00 GMT

Overview

Concrete tool for under-sampling based on instance hardness thresholding provided by the imbalanced-learn library.

Description

The InstanceHardnessThreshold class implements an under-sampling strategy that removes samples which are hard to classify. It extends BaseUnderSampler and uses cross-validated predictions from a classifier to estimate instance hardness (the probability of correct classification for each sample). Samples from majority classes whose predicted probability of correct classification falls below a computed percentile threshold are removed, retaining only those samples that the classifier can reliably classify.

Usage

Import this class when you want to under-sample majority classes by removing noisy or ambiguous samples that lie near decision boundaries or in overlapping class regions, rather than randomly discarding majority instances.

Code Reference

Source Location

Repository: imbalanced-learn
File: imblearn/under_sampling/_prototype_selection/_instance_hardness_threshold.py
Lines: L1-209

Signature

class InstanceHardnessThreshold(BaseUnderSampler):
    def __init__(
        self,
        *,
        estimator=None,
        sampling_strategy="auto",
        random_state=None,
        cv=5,
        n_jobs=None,
    ):
        """
        Args:
            estimator: estimator object or None - Classifier used to estimate
                instance hardness. Must implement predict_proba. Defaults to
                RandomForestClassifier(n_estimators=100) when None.
            sampling_strategy: str, dict, or callable - Desired ratio of
                minority to majority samples. 'auto' equalizes all classes.
            random_state: int, RandomState, or None - Seed for reproducibility.
            cv: int - Number of cross-validation folds used to estimate
                instance hardness (default: 5).
            n_jobs: int or None - Number of parallel jobs for cross-validation
                and the default estimator.
        """

Import

from imblearn.under_sampling import InstanceHardnessThreshold

I/O Contract

Inputs

Name	Type	Required	Description
X	{array-like, sparse matrix, dataframe} of shape (n_samples, n_features)	Yes	Feature matrix of training data
y	array-like of shape (n_samples,)	Yes	Target labels
estimator	estimator object or None	No	Classifier with predict_proba (default: RandomForestClassifier)
sampling_strategy	str, dict, or callable	No	Resampling ratio (default: 'auto')
random_state	int, RandomState, or None	No	Random seed
cv	int	No	Number of cross-validation folds (default: 5)
n_jobs	int or None	No	Number of parallel jobs (default: None)

Outputs

Name	Type	Description
X_resampled	{ndarray, sparse matrix, dataframe} of shape (n_samples_new, n_features)	Feature matrix with hard-to-classify majority samples removed
y_resampled	ndarray of shape (n_samples_new,)	Target array after under-sampling

Key Attributes After Fitting

Attribute	Type	Description
sampling_strategy_	dict	Maps class labels to number of samples to retain
estimator_	estimator object	The validated classifier used for hardness estimation
sample_indices_	ndarray of shape (n_new_samples,)	Indices of samples selected from the original dataset
n_features_in_	int	Number of features in the input dataset
feature_names_in_	ndarray of shape (n_features_in_,)	Names of features seen during fit (when X has string feature names)

Usage Examples

Basic Usage

from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import InstanceHardnessThreshold

# Create imbalanced dataset
X, y = make_classification(
    n_classes=2, class_sep=2, weights=[0.1, 0.9],
    n_informative=3, n_redundant=1, flip_y=0,
    n_features=20, n_clusters_per_class=1,
    n_samples=1000, random_state=10,
)
print(f"Original: {Counter(y)}")
# Original: Counter({1: 900, 0: 100})

# Apply InstanceHardnessThreshold
iht = InstanceHardnessThreshold(random_state=42)
X_res, y_res = iht.fit_resample(X, y)
print(f"Resampled: {Counter(y_res)}")
# Resampled: Counter({1: 5xx, 0: 100})

Custom Estimator

from sklearn.linear_model import LogisticRegression
from imblearn.under_sampling import InstanceHardnessThreshold

# Use logistic regression instead of default random forest
iht = InstanceHardnessThreshold(
    estimator=LogisticRegression(max_iter=1000),
    cv=10,
    random_state=42,
)
X_res, y_res = iht.fit_resample(X, y)

In a Pipeline

from imblearn.pipeline import make_pipeline
from imblearn.under_sampling import InstanceHardnessThreshold
from sklearn.tree import DecisionTreeClassifier

pipeline = make_pipeline(
    InstanceHardnessThreshold(random_state=42),
    DecisionTreeClassifier(),
)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment