Principle:Cleanlab Cleanlab Cross Validation Predicted Probabilities
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Quality |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Technique for obtaining unbiased model predictions on training data using k-fold cross-validation.
Description
Cross-validation predicted probabilities (also called out-of-sample predictions) solve the overfitting problem where a model's predictions on its own training data are unrealistically confident. When a classifier is trained and evaluated on the same data, it tends to assign near-perfect probabilities to training examples, making it impossible to distinguish genuinely well-labeled examples from mislabeled ones. By training on K-1 folds and predicting on the held-out fold, each example receives a prediction from a model that never saw it during training. This produces realistic, calibrated probability estimates that faithfully reflect the model's true uncertainty about each example's class membership. These out-of-sample predicted probabilities are the foundation for all confident learning methods in cleanlab, serving as the primary input to label issue detection, quality scoring, and dataset health analysis.
Usage
Use when you need out-of-sample predicted probabilities for the entire training dataset. This is a prerequisite for all cleanlab label issue detection methods. If you already have out-of-sample predicted probabilities (e.g., from a separate validation procedure or a pre-trained model evaluated on held-out data), you can skip this step and provide them directly. Otherwise, this cross-validation procedure is the standard way to obtain them.
Theoretical Basis
For K folds, partition the dataset D of N examples into K disjoint subsets D_1, D_2, ..., D_K of approximately equal size.
For each fold k in 1, ..., K:
- Train the classifier f_k on all data except fold k: D \ D_k
- For each example (x_i, y_i) in fold D_k, compute the predicted class probabilities: P_k(label = c | x_i) for all classes c
Concatenate predictions across all folds to assemble the full out-of-sample probability matrix:
pred_probs[i, c] = P_k(label = c | x_i) where x_i is in fold D_k
The resulting matrix pred_probs has shape (N, K) where N is the number of examples and K is the number of classes. Each row sums to 1 and represents a valid probability distribution. Crucially, pred_probs[i] was produced by a model that never observed example i during training, ensuring unbiased estimates.
Stratified splitting is used so that each fold preserves the class distribution of the original dataset, preventing degenerate folds with missing classes.