Implementation:Cleanlab Cleanlab Estimate CV Predicted Probabilities
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Data_Quality |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Concrete tool for computing out-of-sample predicted probabilities via k-fold cross-validation provided by the Cleanlab library.
Description
This function performs stratified k-fold cross-validation to produce out-of-sample predicted probabilities for every example in the training dataset. For each fold, it clones the provided sklearn-compatible classifier, trains it on the non-held-out data, and uses the trained model's predict_proba method to generate class probability predictions for the held-out fold. The predictions from all folds are concatenated into a single (N, K) matrix aligned with the original data order. An optional validation_func can be applied after each fold's training to assess model quality. The function handles stratified splitting to preserve class distributions and supports seeded randomness for reproducibility.
Usage
Import and use this function when you have a feature matrix and noisy labels but do not yet have out-of-sample predicted probabilities. This is the standard first step before calling any cleanlab label issue detection function such as find_label_issues, compute_confident_joint, or get_label_quality_scores.
Code Reference
Source Location
- Repository: cleanlab
- File: cleanlab/count.py
- Lines: 1180-1243
Signature
def estimate_cv_predicted_probabilities(
X,
labels,
clf=LogisticRegression(multi_class="auto", solver="lbfgs", max_iter=1000),
*,
cv_n_folds=5,
seed=None,
clf_kwargs={},
validation_func=None,
) -> np.ndarray
Import
from cleanlab.count import estimate_cv_predicted_probabilities
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| X | np.ndarray | Yes | Feature matrix of shape (N, M) where N is the number of examples and M is the number of features. |
| labels | np.ndarray | Yes | Array of noisy class labels of shape (N,) with integer values in range 0..K-1. |
| clf | sklearn-compatible classifier | No | Any classifier implementing fit() and predict_proba(). Defaults to LogisticRegression with lbfgs solver and max_iter=1000. |
| cv_n_folds | int | No | Number of cross-validation folds. Defaults to 5. |
| seed | Optional[int] | No | Random seed for reproducible fold splits. Defaults to None. |
| clf_kwargs | dict | No | Additional keyword arguments passed to clf.fit(). Defaults to empty dict. |
| validation_func | Optional[Callable] | No | A callable applied after each fold's training for validation purposes. Defaults to None. |
Outputs
| Name | Type | Description |
|---|---|---|
| pred_probs | np.ndarray | Array of shape (N, K) containing out-of-sample predicted probabilities. Each row sums to 1 and represents the model's predicted class distribution for that example, produced by a model that did not train on that example. |
External Dependencies
- numpy -- array operations and concatenation of fold predictions
- sklearn.linear_model.LogisticRegression -- default classifier
- sklearn.model_selection.StratifiedKFold -- stratified fold splitting
- sklearn.base.clone -- cloning classifier for each fold to ensure independent training
Usage Examples
Basic Usage
import numpy as np
from sklearn.datasets import make_classification
from cleanlab.count import estimate_cv_predicted_probabilities
# Generate synthetic data with 3 classes
X, labels = make_classification(
n_samples=500, n_features=10, n_informative=5,
n_classes=3, n_clusters_per_class=1, random_state=42
)
# Get out-of-sample predicted probabilities using default LogisticRegression
pred_probs = estimate_cv_predicted_probabilities(X, labels, cv_n_folds=5, seed=42)
print(pred_probs.shape) # (500, 3)
print(pred_probs[0]) # e.g., [0.85, 0.10, 0.05]
Using a Custom Classifier
from sklearn.ensemble import GradientBoostingClassifier
from cleanlab.count import estimate_cv_predicted_probabilities
clf = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
pred_probs = estimate_cv_predicted_probabilities(
X, labels, clf=clf, cv_n_folds=10, seed=42
)