Principle:Scikit learn contrib Imbalanced learn Instance Hardness Cross Validation
Principle: Instance Hardness Cross-Validation
Instance Hardness Cross-Validation is a cross-validation strategy that accounts for the varying difficulty of individual samples. Standard CV methods (e.g., stratified k-fold) preserve class proportions across folds but ignore the fact that some samples are inherently harder to classify than others. This can lead to folds with uneven difficulty distributions and high variance in per-fold performance metrics.
Problem Statement
In standard stratified cross-validation, samples are split such that each fold has roughly the same class distribution. However, two folds with identical class distributions can still differ dramatically in difficulty:
- Fold A might contain mostly samples near the decision boundary (hard samples).
- Fold B might contain mostly samples far from the decision boundary (easy samples).
This leads to high variance in per-fold test scores, which makes it harder to reliably compare models or estimate generalization performance.
The Instance Hardness Approach
The core idea is to measure the instance hardness of each sample and then distribute hard samples uniformly across folds.
Step 1: Estimate Instance Hardness
Instance hardness is estimated via cross-validated predicted probabilities. A classifier is trained using stratified cross-validation, and for each sample, the predicted probability of the positive class is recorded. Samples with predicted probabilities close to 0.5 (the decision boundary for binary classification) are considered hard; samples with probabilities near 0 or 1 are considered easy.
Step 2: Sort Samples by Class and Hardness
All samples are sorted using a lexicographic sort on two keys:
- Class label (primary sort key)
- Predicted probability of the positive class (secondary sort key)
This groups samples by class and orders them within each class by difficulty.
Step 3: Round-Robin Fold Assignment
Fold indices are assigned in a round-robin fashion over the sorted samples:
sorted_indices = np.lexsort((y_proba[:, pos_label], y))
groups = np.empty(n_samples, dtype=int)
groups[sorted_indices] = np.arange(n_samples) % n_splits
Because the samples are sorted by difficulty, this round-robin assignment ensures that each fold receives an approximately equal share of easy, medium, and hard samples from each class.
Step 4: Split Using Group-Based CV
The assigned fold indices are treated as group labels, and LeaveOneGroupOut is used to generate the actual train/test splits. Each fold serves as the test set exactly once.
Key Properties
- Balanced difficulty: Each fold contains a similar proportion of easy and hard samples, reducing per-fold score variance.
- Class preservation: Because the primary sort key is the class label, class proportions are approximately preserved across folds (similar to stratified CV).
- Supervised estimation: The hardness estimation requires a classifier, making this a supervised CV strategy. The choice of estimator can influence the fold assignments.
- Binary classification: The current formulation is designed for binary classification, where instance hardness is naturally captured by the predicted probability of the positive class.
Trade-offs
- Computational cost: The method requires an internal cross-validation pass to estimate instance hardness, effectively doubling the number of model fits.
- Estimator dependency: The resulting folds depend on the classifier used to estimate hardness. A poor estimator may produce unreliable hardness estimates.
- Data leakage considerations: The internal CV for hardness estimation is stratified and uses proper train/test separation, so there is no direct leakage. However, the fold assignments are influenced by the target labels in a more complex way than standard stratified CV.