Implementation:Cleanlab Cleanlab CleanLearning Fit

Field	Value
Sources	Confident Learning, Cleanlab
Domains	Machine_Learning, Data_Quality
Last Updated	2026-02-09 12:00 GMT

Overview

CleanLearning.fit trains a classifier on data that has been automatically cleaned of detected label issues, producing a model robust to label noise.

Description

The fit method implements the full noise-robust training pipeline. It detects mislabeled examples (or accepts pre-computed label issues), removes them from the training set, and retrains the wrapped classifier on the cleaned data. The method returns self to support sklearn's method chaining convention.

The training pipeline proceeds through the following stages:

Label issue detection or acceptance: If label_issues is not provided, the method calls self.find_label_issues() internally to detect mislabeled examples. If label_issues is provided as a DataFrame (from a previous call to find_label_issues()) or a boolean/integer array, it is used directly.
Data pruning: Examples identified as label issues are removed from X and labels. The indices of removed examples are preserved in self.label_issues_df. If the classifier supports sample_weight, mislabeled examples can alternatively be assigned zero weight.
Sample weight adjustment: If sample_weight is provided, the weights for removed examples are also pruned to maintain alignment between features, labels, and weights.
Final classifier training: The wrapped classifier is fit on the pruned dataset using clf_final_kwargs (which may differ from clf_kwargs used during the cross-validation phase). This allows using different hyperparameters for the final training pass.
State storage: The method stores self.label_issues_df, self.noise_matrix, self.inverse_noise_matrix, self.confident_joint, and self.pred_probs for post-hoc analysis.

The y parameter is accepted as an alias for labels to support sklearn pipeline compatibility.

Usage

Call fit on a CleanLearning instance just as you would call fit on any sklearn classifier.

from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier

cl = CleanLearning(clf=GradientBoostingClassifier())
cl.fit(X_train, labels=y_train)

# Inspect detected issues
print(f"Found {cl.label_issues_df['is_label_issue'].sum()} label issues")
print(f"Trained on {(~cl.label_issues_df['is_label_issue']).sum()} clean examples")

Code Reference

Source Location

Repository: cleanlab/cleanlab
File: cleanlab/classification.py
Lines: 265--582

Signature

def fit(
    self,
    X,
    labels=None,
    *,
    pred_probs=None,
    thresholds=None,
    noise_matrix=None,
    inverse_noise_matrix=None,
    label_issues=None,
    sample_weight=None,
    clf_kwargs={},
    clf_final_kwargs={},
    validation_func=None,
    y=None,
) -> "CleanLearning"

Import

from cleanlab.classification import CleanLearning
# fit is a method of a CleanLearning instance

I/O Contract

Inputs

Name	Type	Required	Description
`X`	array-like (N, M)	Yes	Feature matrix for training data.
`labels`	np.ndarray (N,)	Yes	Array of given (potentially noisy) integer class labels. Can also be passed as `y`.
`pred_probs`	Optional[np.ndarray] (N, K)	No	Pre-computed out-of-sample predicted probabilities. Skips internal cross-validation if provided.
`thresholds`	Optional[np.ndarray] (K,)	No	Per-class thresholds for confident learning.
`noise_matrix`	Optional[np.ndarray] (K, K)	No	true label).
`inverse_noise_matrix`	Optional[np.ndarray] (K, K)	No	given label).
`label_issues`	Optional[pd.DataFrame or np.ndarray]	No	Pre-computed label issues. DataFrame from `find_label_issues()`, boolean array, or integer index array.
`sample_weight`	Optional[np.ndarray] (N,)	No	Per-sample weights. Mislabeled examples are either removed or assigned zero weight.
`clf_kwargs`	dict	No	Keyword arguments passed to the classifier's `fit()` during cross-validation phase.
`clf_final_kwargs`	dict	No	Keyword arguments passed to the classifier's `fit()` during the final training on cleaned data.
`validation_func`	Optional[callable]	No	Validation function called after cross-validation.
`y`	Optional[np.ndarray]	No	Alias for `labels` for sklearn pipeline compatibility.

Outputs

Name	Type	Description
return value	`CleanLearning`	The fitted CleanLearning instance (self), enabling method chaining.

Stored attributes after fit:

Attribute	Type	Description
`self.label_issues_df`	pd.DataFrame	DataFrame with `is_label_issue`, `label_quality`, `given_label`, `predicted_label` columns.
`self.noise_matrix`	np.ndarray (K, K)	true label).
`self.inverse_noise_matrix`	np.ndarray (K, K)	given label).
`self.confident_joint`	np.ndarray (K, K)	Estimated confident joint counting matrix.
`self.pred_probs`	np.ndarray (N, K)	Out-of-sample predicted probabilities from cross-validation.

Usage Examples

Standard Training

from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier

cl = CleanLearning(clf=GradientBoostingClassifier())
cl.fit(X_train, labels=y_train)

# Model is trained on automatically cleaned data
predictions = cl.predict(X_test)

Training with Pre-computed Label Issues

from cleanlab.classification import CleanLearning

cl = CleanLearning()

# Step 1: Find label issues first
label_issues_df = cl.find_label_issues(X_train, labels=y_train)

# Step 2: Review and possibly modify the label issues
# (e.g., manually verify some flagged examples)

# Step 3: Fit using the pre-computed label issues
cl.fit(X_train, labels=y_train, label_issues=label_issues_df)

Training with Different CV and Final Hyperparameters

from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier

cl = CleanLearning(clf=GradientBoostingClassifier())
cl.fit(
    X_train,
    labels=y_train,
    clf_kwargs={"sample_weight": None},       # CV phase: no sample weights
    clf_final_kwargs={"sample_weight": weights},  # Final phase: use sample weights
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment