Implementation:Cleanlab Cleanlab Datalab Find Issues
| Field | Value |
|---|---|
| Sources | Cleanlab |
| Domains | Data_Quality, Dataset_Auditing, Machine_Learning |
| Last Updated | 2026-02-09 12:00 GMT |
Overview
Datalab_Find_Issues is the core audit method that orchestrates multiple specialized issue detectors to find diverse quality problems in a dataset from a single API call.
Description
The Datalab.find_issues method accepts optional model outputs (predicted probabilities, feature embeddings, precomputed nearest neighbor graph) and runs a battery of specialized issue managers against the dataset. An IssueFinder orchestrator determines which managers can run given the available inputs and executes them in sequence. Each manager computes per-example boolean flags and numeric severity scores, which are all stored in the Datalab instance's internal DataIssues container. The method mutates the Datalab instance in-place and does not return a value.
If an empty issue_types dictionary is passed, a warning is emitted and no issues are detected. If issue_types is None, the default set of issue types for the configured task is used.
Usage
Call this method on an initialized Datalab instance after training your model and obtaining out-of-sample predicted probabilities. Provide as many inputs as you have available to maximize the types of issues that can be detected.
Code Reference
Source Location
- Repository
cleanlab/cleanlab- File
cleanlab/datalab/datalab.py- Lines
- 151--158
Signature
def find_issues(
self,
*,
pred_probs: Optional[np.ndarray] = None,
features: Optional[npt.NDArray] = None,
knn_graph: Optional[csr_matrix] = None,
issue_types: Optional[Dict[str, Any]] = None,
) -> None
Import
from cleanlab import Datalab
# find_issues is a method of the Datalab instance
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
pred_probs |
Optional[np.ndarray] |
No | Out-of-sample predicted class probabilities with shape (N, K) for classification, (N,) for regression, or (N, K) for multilabel. Columns must be ordered by lexicographically sorted class names. Enables label issue detection.
|
features |
Optional[np.ndarray] |
No | Feature embeddings with shape (N, D). Enables outlier, duplicate, and non-IID issue detection. Used to construct a knn_graph if one is not provided.
|
knn_graph |
Optional[scipy.sparse.csr_matrix] |
No | Precomputed k-nearest-neighbor distance graph as a sparse CSR matrix with shape (N, N). Non-zero entries represent distances. Takes precedence over features if both are provided.
|
issue_types |
Optional[Dict[str, Any]] |
No | Dictionary specifying which issue types to check and their configuration. Keys are issue type names, values are dicts of keyword arguments for the corresponding IssueManager. If None, defaults are used.
|
Outputs
| Name | Type | Description |
|---|---|---|
| return | None |
The method mutates the Datalab instance in-place, populating self.data_issues with per-example issue flags, scores, and summary statistics. Access results via get_issues(), get_issue_summary(), or report().
|
Usage Examples
With Predicted Probabilities
from sklearn.linear_model import LogisticRegression
import numpy as np
from cleanlab import Datalab
X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]])
y = np.array([0, 1, 1, 0])
clf = LogisticRegression(random_state=0).fit(X, y)
pred_probs = clf.predict_proba(X)
lab = Datalab(data={"X": X, "y": y}, label_name="y")
lab.find_issues(pred_probs=pred_probs)
With Feature Embeddings
import numpy as np
from cleanlab import Datalab
X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]])
y = np.array([0, 1, 1, 0])
lab = Datalab(data={"X": X, "y": y}, label_name="y")
lab.find_issues(features=X)
With a Precomputed KNN Graph
from sklearn.neighbors import NearestNeighbors
import numpy as np
from cleanlab import Datalab
X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]])
y = np.array([0, 1, 1, 0])
nbrs = NearestNeighbors(n_neighbors=2, metric="euclidean").fit(X)
knn_graph = nbrs.kneighbors_graph(mode="distance")
lab = Datalab(data={"X": X, "y": y}, label_name="y")
lab.find_issues(knn_graph=knn_graph)
Selecting Specific Issue Types
from cleanlab import Datalab
# Only check for label issues with custom configuration
issue_types = {
"label": {
"clean_learning_kwargs": {
"prune_method": "prune_by_noise_rate",
},
},
}
lab.find_issues(pred_probs=pred_probs, issue_types=issue_types)