Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Cleanlab Cleanlab Estimate CV Predicted Probabilities

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Data_Quality
Last Updated 2026-02-09 19:00 GMT

Overview

Concrete tool for computing out-of-sample predicted probabilities via k-fold cross-validation provided by the Cleanlab library.

Description

This function performs stratified k-fold cross-validation to produce out-of-sample predicted probabilities for every example in the training dataset. For each fold, it clones the provided sklearn-compatible classifier, trains it on the non-held-out data, and uses the trained model's predict_proba method to generate class probability predictions for the held-out fold. The predictions from all folds are concatenated into a single (N, K) matrix aligned with the original data order. An optional validation_func can be applied after each fold's training to assess model quality. The function handles stratified splitting to preserve class distributions and supports seeded randomness for reproducibility.

Usage

Import and use this function when you have a feature matrix and noisy labels but do not yet have out-of-sample predicted probabilities. This is the standard first step before calling any cleanlab label issue detection function such as find_label_issues, compute_confident_joint, or get_label_quality_scores.

Code Reference

Source Location

  • Repository: cleanlab
  • File: cleanlab/count.py
  • Lines: 1180-1243

Signature

def estimate_cv_predicted_probabilities(
    X,
    labels,
    clf=LogisticRegression(multi_class="auto", solver="lbfgs", max_iter=1000),
    *,
    cv_n_folds=5,
    seed=None,
    clf_kwargs={},
    validation_func=None,
) -> np.ndarray

Import

from cleanlab.count import estimate_cv_predicted_probabilities

I/O Contract

Inputs

Name Type Required Description
X np.ndarray Yes Feature matrix of shape (N, M) where N is the number of examples and M is the number of features.
labels np.ndarray Yes Array of noisy class labels of shape (N,) with integer values in range 0..K-1.
clf sklearn-compatible classifier No Any classifier implementing fit() and predict_proba(). Defaults to LogisticRegression with lbfgs solver and max_iter=1000.
cv_n_folds int No Number of cross-validation folds. Defaults to 5.
seed Optional[int] No Random seed for reproducible fold splits. Defaults to None.
clf_kwargs dict No Additional keyword arguments passed to clf.fit(). Defaults to empty dict.
validation_func Optional[Callable] No A callable applied after each fold's training for validation purposes. Defaults to None.

Outputs

Name Type Description
pred_probs np.ndarray Array of shape (N, K) containing out-of-sample predicted probabilities. Each row sums to 1 and represents the model's predicted class distribution for that example, produced by a model that did not train on that example.

External Dependencies

  • numpy -- array operations and concatenation of fold predictions
  • sklearn.linear_model.LogisticRegression -- default classifier
  • sklearn.model_selection.StratifiedKFold -- stratified fold splitting
  • sklearn.base.clone -- cloning classifier for each fold to ensure independent training

Usage Examples

Basic Usage

import numpy as np
from sklearn.datasets import make_classification
from cleanlab.count import estimate_cv_predicted_probabilities

# Generate synthetic data with 3 classes
X, labels = make_classification(
    n_samples=500, n_features=10, n_informative=5,
    n_classes=3, n_clusters_per_class=1, random_state=42
)

# Get out-of-sample predicted probabilities using default LogisticRegression
pred_probs = estimate_cv_predicted_probabilities(X, labels, cv_n_folds=5, seed=42)

print(pred_probs.shape)  # (500, 3)
print(pred_probs[0])     # e.g., [0.85, 0.10, 0.05]

Using a Custom Classifier

from sklearn.ensemble import GradientBoostingClassifier
from cleanlab.count import estimate_cv_predicted_probabilities

clf = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)

pred_probs = estimate_cv_predicted_probabilities(
    X, labels, clf=clf, cv_n_folds=10, seed=42
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment