Implementation:Cleanlab Cleanlab CleanLearning Fit
| Field | Value |
|---|---|
| Sources | Confident Learning, Cleanlab |
| Domains | Machine_Learning, Data_Quality |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
CleanLearning.fit trains a classifier on data that has been automatically cleaned of detected label issues, producing a model robust to label noise.
Description
The fit method implements the full noise-robust training pipeline. It detects mislabeled examples (or accepts pre-computed label issues), removes them from the training set, and retrains the wrapped classifier on the cleaned data. The method returns self to support sklearn's method chaining convention.
The training pipeline proceeds through the following stages:
- Label issue detection or acceptance: If
label_issuesis not provided, the method callsself.find_label_issues()internally to detect mislabeled examples. Iflabel_issuesis provided as a DataFrame (from a previous call tofind_label_issues()) or a boolean/integer array, it is used directly. - Data pruning: Examples identified as label issues are removed from
Xandlabels. The indices of removed examples are preserved inself.label_issues_df. If the classifier supportssample_weight, mislabeled examples can alternatively be assigned zero weight. - Sample weight adjustment: If
sample_weightis provided, the weights for removed examples are also pruned to maintain alignment between features, labels, and weights. - Final classifier training: The wrapped classifier is fit on the pruned dataset using
clf_final_kwargs(which may differ fromclf_kwargsused during the cross-validation phase). This allows using different hyperparameters for the final training pass. - State storage: The method stores
self.label_issues_df,self.noise_matrix,self.inverse_noise_matrix,self.confident_joint, andself.pred_probsfor post-hoc analysis.
The y parameter is accepted as an alias for labels to support sklearn pipeline compatibility.
Usage
Call fit on a CleanLearning instance just as you would call fit on any sklearn classifier.
from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier
cl = CleanLearning(clf=GradientBoostingClassifier())
cl.fit(X_train, labels=y_train)
# Inspect detected issues
print(f"Found {cl.label_issues_df['is_label_issue'].sum()} label issues")
print(f"Trained on {(~cl.label_issues_df['is_label_issue']).sum()} clean examples")
Code Reference
Source Location
- Repository
cleanlab/cleanlab- File
cleanlab/classification.py- Lines
- 265--582
Signature
def fit(
self,
X,
labels=None,
*,
pred_probs=None,
thresholds=None,
noise_matrix=None,
inverse_noise_matrix=None,
label_issues=None,
sample_weight=None,
clf_kwargs={},
clf_final_kwargs={},
validation_func=None,
y=None,
) -> "CleanLearning"
Import
from cleanlab.classification import CleanLearning
# fit is a method of a CleanLearning instance
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
X |
array-like (N, M) | Yes | Feature matrix for training data. |
labels |
np.ndarray (N,) | Yes | Array of given (potentially noisy) integer class labels. Can also be passed as y.
|
pred_probs |
Optional[np.ndarray] (N, K) | No | Pre-computed out-of-sample predicted probabilities. Skips internal cross-validation if provided. |
thresholds |
Optional[np.ndarray] (K,) | No | Per-class thresholds for confident learning. |
noise_matrix |
Optional[np.ndarray] (K, K) | No | true label). |
inverse_noise_matrix |
Optional[np.ndarray] (K, K) | No | given label). |
label_issues |
Optional[pd.DataFrame or np.ndarray] | No | Pre-computed label issues. DataFrame from find_label_issues(), boolean array, or integer index array.
|
sample_weight |
Optional[np.ndarray] (N,) | No | Per-sample weights. Mislabeled examples are either removed or assigned zero weight. |
clf_kwargs |
dict | No | Keyword arguments passed to the classifier's fit() during cross-validation phase.
|
clf_final_kwargs |
dict | No | Keyword arguments passed to the classifier's fit() during the final training on cleaned data.
|
validation_func |
Optional[callable] | No | Validation function called after cross-validation. |
y |
Optional[np.ndarray] | No | Alias for labels for sklearn pipeline compatibility.
|
Outputs
| Name | Type | Description |
|---|---|---|
| return value | CleanLearning |
The fitted CleanLearning instance (self), enabling method chaining. |
Stored attributes after fit:
| Attribute | Type | Description |
|---|---|---|
self.label_issues_df |
pd.DataFrame | DataFrame with is_label_issue, label_quality, given_label, predicted_label columns.
|
self.noise_matrix |
np.ndarray (K, K) | true label). |
self.inverse_noise_matrix |
np.ndarray (K, K) | given label). |
self.confident_joint |
np.ndarray (K, K) | Estimated confident joint counting matrix. |
self.pred_probs |
np.ndarray (N, K) | Out-of-sample predicted probabilities from cross-validation. |
Usage Examples
Standard Training
from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier
cl = CleanLearning(clf=GradientBoostingClassifier())
cl.fit(X_train, labels=y_train)
# Model is trained on automatically cleaned data
predictions = cl.predict(X_test)
Training with Pre-computed Label Issues
from cleanlab.classification import CleanLearning
cl = CleanLearning()
# Step 1: Find label issues first
label_issues_df = cl.find_label_issues(X_train, labels=y_train)
# Step 2: Review and possibly modify the label issues
# (e.g., manually verify some flagged examples)
# Step 3: Fit using the pre-computed label issues
cl.fit(X_train, labels=y_train, label_issues=label_issues_df)
Training with Different CV and Final Hyperparameters
from cleanlab.classification import CleanLearning
from sklearn.ensemble import GradientBoostingClassifier
cl = CleanLearning(clf=GradientBoostingClassifier())
cl.fit(
X_train,
labels=y_train,
clf_kwargs={"sample_weight": None}, # CV phase: no sample weights
clf_final_kwargs={"sample_weight": weights}, # Final phase: use sample weights
)