Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Cleanlab Cleanlab CleanLearning Fit

From Leeroopedia


Field Value
Sources Confident Learning, Cleanlab
Domains Machine_Learning, Data_Quality
Last Updated 2026-02-09 12:00 GMT

Overview

CleanLearning.fit trains a classifier on data that has been automatically cleaned of detected label issues, producing a model robust to label noise.

Description

The fit method implements the full noise-robust training pipeline. It detects mislabeled examples (or accepts pre-computed label issues), removes them from the training set, and retrains the wrapped classifier on the cleaned data. The method returns self to support sklearn's method chaining convention.

The training pipeline proceeds through the following stages:

  1. Label issue detection or acceptance: If label_issues is not provided, the method calls self.find_label_issues() internally to detect mislabeled examples. If label_issues is provided as a DataFrame (from a previous call to find_label_issues()) or a boolean/integer array, it is used directly.
  2. Data pruning: Examples identified as label issues are removed from X and labels. The indices of removed examples are preserved in self.label_issues_df. If the classifier supports sample_weight, mislabeled examples can alternatively be assigned zero weight.
  3. Sample weight adjustment: If sample_weight is provided, the weights for removed examples are also pruned to maintain alignment between features, labels, and weights.
  4. Final classifier training: The wrapped classifier is fit on the pruned dataset using clf_final_kwargs (which may differ from clf_kwargs used during the cross-validation phase). This allows using different hyperparameters for the final training pass.
  5. State storage: The method stores self.label_issues_df, self.noise_matrix, self.inverse_noise_matrix, self.confident_joint, and self.pred_probs for post-hoc analysis.

The y parameter is accepted as an alias for labels to support sklearn pipeline compatibility.

Usage

Call fit on a CleanLearning instance just as you would call fit on any sklearn classifier.

from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier

cl = CleanLearning(clf=GradientBoostingClassifier())
cl.fit(X_train, labels=y_train)

# Inspect detected issues
print(f"Found {cl.label_issues_df['is_label_issue'].sum()} label issues")
print(f"Trained on {(~cl.label_issues_df['is_label_issue']).sum()} clean examples")

Code Reference

Source Location

Repository
cleanlab/cleanlab
File
cleanlab/classification.py
Lines
265--582

Signature

def fit(
    self,
    X,
    labels=None,
    *,
    pred_probs=None,
    thresholds=None,
    noise_matrix=None,
    inverse_noise_matrix=None,
    label_issues=None,
    sample_weight=None,
    clf_kwargs={},
    clf_final_kwargs={},
    validation_func=None,
    y=None,
) -> "CleanLearning"

Import

from cleanlab.classification import CleanLearning
# fit is a method of a CleanLearning instance

I/O Contract

Inputs

Name Type Required Description
X array-like (N, M) Yes Feature matrix for training data.
labels np.ndarray (N,) Yes Array of given (potentially noisy) integer class labels. Can also be passed as y.
pred_probs Optional[np.ndarray] (N, K) No Pre-computed out-of-sample predicted probabilities. Skips internal cross-validation if provided.
thresholds Optional[np.ndarray] (K,) No Per-class thresholds for confident learning.
noise_matrix Optional[np.ndarray] (K, K) No true label).
inverse_noise_matrix Optional[np.ndarray] (K, K) No given label).
label_issues Optional[pd.DataFrame or np.ndarray] No Pre-computed label issues. DataFrame from find_label_issues(), boolean array, or integer index array.
sample_weight Optional[np.ndarray] (N,) No Per-sample weights. Mislabeled examples are either removed or assigned zero weight.
clf_kwargs dict No Keyword arguments passed to the classifier's fit() during cross-validation phase.
clf_final_kwargs dict No Keyword arguments passed to the classifier's fit() during the final training on cleaned data.
validation_func Optional[callable] No Validation function called after cross-validation.
y Optional[np.ndarray] No Alias for labels for sklearn pipeline compatibility.

Outputs

Name Type Description
return value CleanLearning The fitted CleanLearning instance (self), enabling method chaining.

Stored attributes after fit:

Attribute Type Description
self.label_issues_df pd.DataFrame DataFrame with is_label_issue, label_quality, given_label, predicted_label columns.
self.noise_matrix np.ndarray (K, K) true label).
self.inverse_noise_matrix np.ndarray (K, K) given label).
self.confident_joint np.ndarray (K, K) Estimated confident joint counting matrix.
self.pred_probs np.ndarray (N, K) Out-of-sample predicted probabilities from cross-validation.

Usage Examples

Standard Training

from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier

cl = CleanLearning(clf=GradientBoostingClassifier())
cl.fit(X_train, labels=y_train)

# Model is trained on automatically cleaned data
predictions = cl.predict(X_test)

Training with Pre-computed Label Issues

from cleanlab.classification import CleanLearning

cl = CleanLearning()

# Step 1: Find label issues first
label_issues_df = cl.find_label_issues(X_train, labels=y_train)

# Step 2: Review and possibly modify the label issues
# (e.g., manually verify some flagged examples)

# Step 3: Fit using the pre-computed label issues
cl.fit(X_train, labels=y_train, label_issues=label_issues_df)

Training with Different CV and Final Hyperparameters

from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier

cl = CleanLearning(clf=GradientBoostingClassifier())
cl.fit(
    X_train,
    labels=y_train,
    clf_kwargs={"sample_weight": None},       # CV phase: no sample weights
    clf_final_kwargs={"sample_weight": weights},  # Final phase: use sample weights
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment