Principle:Scikit learn contrib Imbalanced learn Instance Hardness Thresholding
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Preprocessing, Imbalanced_Learning |
| Last Updated | 2026-02-09 03:00 GMT |
Overview
An under-sampling technique that removes majority-class samples which are hardest to classify, as estimated by a cross-validated classifier.
Description
Instance Hardness Thresholding is a data cleaning method based on the concept of instance hardness, introduced by Smith et al. (2014). Instance hardness measures how likely a given sample is to be misclassified by a learning algorithm. Samples with high instance hardness are those located near decision boundaries, in overlapping class regions, or representing noise. By removing these hard-to-classify majority samples, the method achieves class balance while simultaneously cleaning the dataset of ambiguous or noisy instances.
Unlike random under-sampling, which discards majority samples indiscriminately, instance hardness thresholding uses a principled, classifier-informed criterion to decide which samples to remove. This preserves the most informative majority-class samples and removes those that would confuse downstream classifiers.
Usage
Use this principle when:
- Random under-sampling discards too many informative majority samples
- The dataset contains noisy or ambiguous majority samples near the decision boundary
- A classifier-guided cleaning approach is preferred over geometric or distance-based methods
- You want the under-sampling step to improve classifier performance by removing confusing samples
Theoretical Basis
The algorithm proceeds in three steps:
- Cross-validated probability estimation: Train a classifier (e.g., Random Forest) using stratified k-fold cross-validation to obtain predicted class probabilities for every sample. For each sample x_i with true label y_i, extract the predicted probability of the correct class: .
- Instance hardness computation: The instance hardness of x_i is defined as . Samples with low are hard to classify (high hardness) and are candidates for removal.
- Threshold-based removal: For each class targeted for under-sampling, compute a percentile threshold based on the desired number of samples to retain. Remove samples whose probability of correct classification falls below this threshold, keeping only the n most confidently classified samples.
Pseudo-code:
# Abstract Instance Hardness Thresholding algorithm (NOT real implementation)
# Step 1: Cross-validated probability estimation
probabilities = cross_val_predict(estimator, X, y, cv=k, method="predict_proba")
# Extract probability of correct class for each sample
p_correct = probabilities[range(len(y)), y]
# Step 2 & 3: For each class to under-sample
for target_class in classes_to_undersample:
n_desired = sampling_strategy[target_class]
n_current = count(y == target_class)
# Compute threshold as percentile
threshold = percentile(
p_correct[y == target_class],
(1.0 - n_desired / n_current) * 100
)
# Keep samples above threshold (easiest to classify)
keep = p_correct[y == target_class] >= threshold
Key Properties
- Classifier-dependent: The set of removed samples depends on the choice of estimator. Different classifiers may identify different samples as hard to classify.
- Cross-validation: Using k-fold cross-validation ensures that hardness estimates are not biased by overfitting, since each sample's probability is predicted by a model that did not train on it.
- Percentile-based threshold: The threshold is automatically computed to retain exactly the desired number of samples, making it compatible with any sampling strategy.
- Multi-class support: The algorithm operates independently on each class targeted for under-sampling, retaining the observations with the highest probability of being correctly classified.