Principle:Cleanlab Cleanlab Regression Noise Robust Training
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Data Quality, Regression, Noise-Robust Learning |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Noise-robust regression training detects and removes examples with corrupted target values from a dataset so that a regression model can be trained as if the data had correct labels, using residual analysis, uncertainty estimation, and iterative data pruning.
Description
In real-world regression tasks, target values (y) are often corrupted by noise: measurement errors, data entry mistakes, or systematic annotation biases. Training a regression model directly on such noisy data can lead to degraded performance and unreliable predictions. Noise-robust regression training addresses this by identifying which examples have corrupted targets and excluding them before final model training.
The approach combines several techniques:
- Cross-validated residual analysis: Out-of-fold predictions are obtained via K-fold cross-validation. The residual (prediction minus given label) for each example serves as an initial signal of label corruption, since examples with large residuals are more likely to have incorrect labels.
- Optimal pruning fraction search: Rather than using a fixed threshold, the algorithm searches for the optimal fraction k of data to exclude by evaluating model performance (R-squared) across multiple values of k using a coarse-then-fine grid search. This data-driven approach avoids arbitrary threshold selection.
- Uncertainty-adjusted scoring: Raw residuals are adjusted by estimated uncertainty to prevent false positives. Examples with large residuals but also high uncertainty (where the model is inherently less certain) are treated differently from examples with large residuals and low uncertainty.
- Clean data training: After flagging examples with issues, the final model is trained only on the remaining clean subset, producing predictions as if the model had access to correctly labeled data.
Usage
Use noise-robust regression training when:
- Your regression dataset may contain corrupted target values (noisy labels).
- You want to automatically detect which examples have erroneous targets.
- You want to train a model that performs as if the training data were clean.
- You want per-example label quality scores for data auditing or re-annotation prioritization.
This approach is model-agnostic and works with any scikit-learn-compatible regression estimator.
Theoretical Basis
1. Cross-Validated Residual Computation:
Using K-fold cross-validation, out-of-fold predictions are obtained for each example:
residual[i] = prediction[i] - y[i]
Examples with correct labels should have small residuals, while corrupted labels produce larger residuals.
2. Optimal Pruning Fraction (k):
The fraction k of data to exclude is selected by maximizing the R-squared score:
k* = argmax_k R2(y, predictions_k)
where predictions_k are obtained by training on data with the top k% highest-residual examples excluded. A coarse search (e.g., k in {0.01, 0.05, 0.1, 0.15, 0.2}) identifies the neighborhood of the optimal k, followed by a fine-grained search within that neighborhood. If the initial R-squared (k=0, no exclusion) is best, no data is pruned.
3. Uncertainty Estimation:
Total uncertainty for each example combines two components:
uncertainty[i] = epistemic_uncertainty[i] + aleatoric_uncertainty
Epistemic uncertainty (model uncertainty) is estimated via bootstrap resampling. Multiple copies of the model are trained on bootstrapped data, and the variance of their predictions measures how uncertain the model is about each example:
epistemic_uncertainty[i] = sqrt(Var(bootstrap_predictions[i]))
Aleatoric uncertainty (data noise) is estimated by predicting the residuals themselves via cross-validation, with the variance of residual predictions capturing inherent data noise:
aleatoric_uncertainty = sqrt(Var(residual_predictions))
4. Label Quality Scoring:
Residuals are adjusted by uncertainty and normalized:
adjusted_residual[i] = |residual[i]| / (uncertainty[i] + epsilon) label_quality[i] = exp(-adjusted_residual[i] / median(adjusted_residual))
This produces scores in (0, 1], where values close to 1 indicate likely correct labels and values close to 0 indicate likely corrupted labels. The exponential decay ensures a smooth scoring function, and normalization by the median makes scores interpretable across datasets.
5. Issue Flagging and Clean Training:
The top ceil(N * k) examples with the lowest label quality scores are flagged as having label issues. The final model is then trained exclusively on the remaining clean examples:
X_clean, y_clean = X[~is_issue], y[~is_issue] model.fit(X_clean, y_clean)