Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Scikit learn contrib Imbalanced learn Instance Hardness Thresholding

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Data_Preprocessing, Imbalanced_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

An under-sampling technique that removes majority-class samples which are hardest to classify, as estimated by a cross-validated classifier.

Description

Instance Hardness Thresholding is a data cleaning method based on the concept of instance hardness, introduced by Smith et al. (2014). Instance hardness measures how likely a given sample is to be misclassified by a learning algorithm. Samples with high instance hardness are those located near decision boundaries, in overlapping class regions, or representing noise. By removing these hard-to-classify majority samples, the method achieves class balance while simultaneously cleaning the dataset of ambiguous or noisy instances.

Unlike random under-sampling, which discards majority samples indiscriminately, instance hardness thresholding uses a principled, classifier-informed criterion to decide which samples to remove. This preserves the most informative majority-class samples and removes those that would confuse downstream classifiers.

Usage

Use this principle when:

  • Random under-sampling discards too many informative majority samples
  • The dataset contains noisy or ambiguous majority samples near the decision boundary
  • A classifier-guided cleaning approach is preferred over geometric or distance-based methods
  • You want the under-sampling step to improve classifier performance by removing confusing samples

Theoretical Basis

The algorithm proceeds in three steps:

  1. Cross-validated probability estimation: Train a classifier (e.g., Random Forest) using stratified k-fold cross-validation to obtain predicted class probabilities for every sample. For each sample x_i with true label y_i, extract the predicted probability of the correct class: p(yi|xi).
  2. Instance hardness computation: The instance hardness of x_i is defined as 1p(yi|xi). Samples with low p(yi|xi) are hard to classify (high hardness) and are candidates for removal.
  3. Threshold-based removal: For each class targeted for under-sampling, compute a percentile threshold based on the desired number of samples to retain. Remove samples whose probability of correct classification falls below this threshold, keeping only the n most confidently classified samples.

Pseudo-code:

# Abstract Instance Hardness Thresholding algorithm (NOT real implementation)
# Step 1: Cross-validated probability estimation
probabilities = cross_val_predict(estimator, X, y, cv=k, method="predict_proba")
# Extract probability of correct class for each sample
p_correct = probabilities[range(len(y)), y]

# Step 2 & 3: For each class to under-sample
for target_class in classes_to_undersample:
    n_desired = sampling_strategy[target_class]
    n_current = count(y == target_class)
    # Compute threshold as percentile
    threshold = percentile(
        p_correct[y == target_class],
        (1.0 - n_desired / n_current) * 100
    )
    # Keep samples above threshold (easiest to classify)
    keep = p_correct[y == target_class] >= threshold

Key Properties

  • Classifier-dependent: The set of removed samples depends on the choice of estimator. Different classifiers may identify different samples as hard to classify.
  • Cross-validation: Using k-fold cross-validation ensures that hardness estimates are not biased by overfitting, since each sample's probability is predicted by a model that did not train on it.
  • Percentile-based threshold: The threshold is automatically computed to retain exactly the desired number of samples, making it compatible with any sampling strategy.
  • Multi-class support: The algorithm operates independently on each class targeted for under-sampling, retaining the observations with the highest probability of being correctly classified.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment