Principle:Scikit learn contrib Imbalanced learn Instance Hardness Thresholding

Knowledge Sources	Smith et al. "An instance level analysis of data complexity." Machine Learning 95.2 (2014): 225-256.
Domains	Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated	2026-02-09 03:00 GMT

Overview

An under-sampling technique that removes majority-class samples which are hardest to classify, as estimated by a cross-validated classifier.

Description

Instance Hardness Thresholding is a data cleaning method based on the concept of instance hardness, introduced by Smith et al. (2014). Instance hardness measures how likely a given sample is to be misclassified by a learning algorithm. Samples with high instance hardness are those located near decision boundaries, in overlapping class regions, or representing noise. By removing these hard-to-classify majority samples, the method achieves class balance while simultaneously cleaning the dataset of ambiguous or noisy instances.

Unlike random under-sampling, which discards majority samples indiscriminately, instance hardness thresholding uses a principled, classifier-informed criterion to decide which samples to remove. This preserves the most informative majority-class samples and removes those that would confuse downstream classifiers.

Usage

Use this principle when:

Random under-sampling discards too many informative majority samples
The dataset contains noisy or ambiguous majority samples near the decision boundary
A classifier-guided cleaning approach is preferred over geometric or distance-based methods
You want the under-sampling step to improve classifier performance by removing confusing samples

Theoretical Basis

The algorithm proceeds in three steps:

Cross-validated probability estimation: Train a classifier (e.g., Random Forest) using stratified k-fold cross-validation to obtain predicted class probabilities for every sample. For each sample x_i with true label y_i, extract the predicted probability of the correct class: $p (y_{i} | x_{i})$ .
Instance hardness computation: The instance hardness of x_i is defined as $1 - p (y_{i} | x_{i})$ . Samples with low $p (y_{i} | x_{i})$ are hard to classify (high hardness) and are candidates for removal.
Threshold-based removal: For each class targeted for under-sampling, compute a percentile threshold based on the desired number of samples to retain. Remove samples whose probability of correct classification falls below this threshold, keeping only the n most confidently classified samples.

Pseudo-code:

# Abstract Instance Hardness Thresholding algorithm (NOT real implementation)
# Step 1: Cross-validated probability estimation
probabilities = cross_val_predict(estimator, X, y, cv=k, method="predict_proba")
# Extract probability of correct class for each sample
p_correct = probabilities[range(len(y)), y]

# Step 2 & 3: For each class to under-sample
for target_class in classes_to_undersample:
    n_desired = sampling_strategy[target_class]
    n_current = count(y == target_class)
    # Compute threshold as percentile
    threshold = percentile(
        p_correct[y == target_class],
        (1.0 - n_desired / n_current) * 100
    )
    # Keep samples above threshold (easiest to classify)
    keep = p_correct[y == target_class] >= threshold

Key Properties

Classifier-dependent: The set of removed samples depends on the choice of estimator. Different classifiers may identify different samples as hard to classify.
Cross-validation: Using k-fold cross-validation ensures that hardness estimates are not biased by overfitting, since each sample's probability is predicted by a model that did not train on it.
Percentile-based threshold: The threshold is automatically computed to retain exactly the desired number of samples, making it compatible with any sampling strategy.
Multi-class support: The algorithm operates independently on each class targeted for under-sampling, retaining the observations with the highest probability of being correctly classified.

Related Pages

Implemented By

Implementation:Scikit_learn_contrib_Imbalanced_learn_InstanceHardnessThreshold

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment