Implementation:Cleanlab Cleanlab Estimate CV Predicted Probabilities

Knowledge Sources	Cleanlab Cleanlab Docs
Domains	Machine_Learning, Data_Quality
Last Updated	2026-02-09 19:00 GMT

Overview

Concrete tool for computing out-of-sample predicted probabilities via k-fold cross-validation provided by the Cleanlab library.

Description

This function performs stratified k-fold cross-validation to produce out-of-sample predicted probabilities for every example in the training dataset. For each fold, it clones the provided sklearn-compatible classifier, trains it on the non-held-out data, and uses the trained model's predict_proba method to generate class probability predictions for the held-out fold. The predictions from all folds are concatenated into a single (N, K) matrix aligned with the original data order. An optional validation_func can be applied after each fold's training to assess model quality. The function handles stratified splitting to preserve class distributions and supports seeded randomness for reproducibility.

Usage

Import and use this function when you have a feature matrix and noisy labels but do not yet have out-of-sample predicted probabilities. This is the standard first step before calling any cleanlab label issue detection function such as find_label_issues, compute_confident_joint, or get_label_quality_scores.

Code Reference

Source Location

Repository: cleanlab
File: cleanlab/count.py
Lines: 1180-1243

Signature

def estimate_cv_predicted_probabilities(
    X,
    labels,
    clf=LogisticRegression(multi_class="auto", solver="lbfgs", max_iter=1000),
    *,
    cv_n_folds=5,
    seed=None,
    clf_kwargs={},
    validation_func=None,
) -> np.ndarray

Import

from cleanlab.count import estimate_cv_predicted_probabilities

I/O Contract

Inputs

Name	Type	Required	Description
X	np.ndarray	Yes	Feature matrix of shape (N, M) where N is the number of examples and M is the number of features.
labels	np.ndarray	Yes	Array of noisy class labels of shape (N,) with integer values in range 0..K-1.
clf	sklearn-compatible classifier	No	Any classifier implementing fit() and predict_proba(). Defaults to LogisticRegression with lbfgs solver and max_iter=1000.
cv_n_folds	int	No	Number of cross-validation folds. Defaults to 5.
seed	Optional[int]	No	Random seed for reproducible fold splits. Defaults to None.
clf_kwargs	dict	No	Additional keyword arguments passed to clf.fit(). Defaults to empty dict.
validation_func	Optional[Callable]	No	A callable applied after each fold's training for validation purposes. Defaults to None.

Outputs

Name	Type	Description
pred_probs	np.ndarray	Array of shape (N, K) containing out-of-sample predicted probabilities. Each row sums to 1 and represents the model's predicted class distribution for that example, produced by a model that did not train on that example.

External Dependencies

numpy -- array operations and concatenation of fold predictions
sklearn.linear_model.LogisticRegression -- default classifier
sklearn.model_selection.StratifiedKFold -- stratified fold splitting
sklearn.base.clone -- cloning classifier for each fold to ensure independent training

Usage Examples

Basic Usage

import numpy as np
from sklearn.datasets import make_classification
from cleanlab.count import estimate_cv_predicted_probabilities

# Generate synthetic data with 3 classes
X, labels = make_classification(
    n_samples=500, n_features=10, n_informative=5,
    n_classes=3, n_clusters_per_class=1, random_state=42
)

# Get out-of-sample predicted probabilities using default LogisticRegression
pred_probs = estimate_cv_predicted_probabilities(X, labels, cv_n_folds=5, seed=42)

print(pred_probs.shape)  # (500, 3)
print(pred_probs[0])     # e.g., [0.85, 0.10, 0.05]

Using a Custom Classifier

from sklearn.ensemble import GradientBoostingClassifier
from cleanlab.count import estimate_cv_predicted_probabilities

clf = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)

pred_probs = estimate_cv_predicted_probabilities(
    X, labels, clf=clf, cv_n_folds=10, seed=42
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment