Implementation:Cleanlab Cleanlab Datalab Find Issues

Field	Value
Sources	Cleanlab
Domains	Data_Quality, Dataset_Auditing, Machine_Learning
Last Updated	2026-02-09 12:00 GMT

Overview

Datalab_Find_Issues is the core audit method that orchestrates multiple specialized issue detectors to find diverse quality problems in a dataset from a single API call.

Description

The Datalab.find_issues method accepts optional model outputs (predicted probabilities, feature embeddings, precomputed nearest neighbor graph) and runs a battery of specialized issue managers against the dataset. An IssueFinder orchestrator determines which managers can run given the available inputs and executes them in sequence. Each manager computes per-example boolean flags and numeric severity scores, which are all stored in the Datalab instance's internal DataIssues container. The method mutates the Datalab instance in-place and does not return a value.

If an empty issue_types dictionary is passed, a warning is emitted and no issues are detected. If issue_types is None, the default set of issue types for the configured task is used.

Usage

Call this method on an initialized Datalab instance after training your model and obtaining out-of-sample predicted probabilities. Provide as many inputs as you have available to maximize the types of issues that can be detected.

Code Reference

Source Location

Repository: cleanlab/cleanlab
File: cleanlab/datalab/datalab.py
Lines: 151--158

Signature

def find_issues(
    self,
    *,
    pred_probs: Optional[np.ndarray] = None,
    features: Optional[npt.NDArray] = None,
    knn_graph: Optional[csr_matrix] = None,
    issue_types: Optional[Dict[str, Any]] = None,
) -> None

Import

from cleanlab import Datalab
# find_issues is a method of the Datalab instance

I/O Contract

Inputs

Name	Type	Required	Description
`pred_probs`	`Optional[np.ndarray]`	No	Out-of-sample predicted class probabilities with shape `(N, K)` for classification, `(N,)` for regression, or `(N, K)` for multilabel. Columns must be ordered by lexicographically sorted class names. Enables label issue detection.
`features`	`Optional[np.ndarray]`	No	Feature embeddings with shape `(N, D)`. Enables outlier, duplicate, and non-IID issue detection. Used to construct a knn_graph if one is not provided.
`knn_graph`	`Optional[scipy.sparse.csr_matrix]`	No	Precomputed k-nearest-neighbor distance graph as a sparse CSR matrix with shape `(N, N)`. Non-zero entries represent distances. Takes precedence over features if both are provided.
`issue_types`	`Optional[Dict[str, Any]]`	No	Dictionary specifying which issue types to check and their configuration. Keys are issue type names, values are dicts of keyword arguments for the corresponding IssueManager. If `None`, defaults are used.

Outputs

Name	Type	Description
return	`None`	The method mutates the Datalab instance in-place, populating `self.data_issues` with per-example issue flags, scores, and summary statistics. Access results via `get_issues()`, `get_issue_summary()`, or `report()`.

Usage Examples

With Predicted Probabilities

from sklearn.linear_model import LogisticRegression
import numpy as np
from cleanlab import Datalab

X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]])
y = np.array([0, 1, 1, 0])
clf = LogisticRegression(random_state=0).fit(X, y)
pred_probs = clf.predict_proba(X)

lab = Datalab(data={"X": X, "y": y}, label_name="y")
lab.find_issues(pred_probs=pred_probs)

With Feature Embeddings

import numpy as np
from cleanlab import Datalab

X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]])
y = np.array([0, 1, 1, 0])

lab = Datalab(data={"X": X, "y": y}, label_name="y")
lab.find_issues(features=X)

With a Precomputed KNN Graph

from sklearn.neighbors import NearestNeighbors
import numpy as np
from cleanlab import Datalab

X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]])
y = np.array([0, 1, 1, 0])

nbrs = NearestNeighbors(n_neighbors=2, metric="euclidean").fit(X)
knn_graph = nbrs.kneighbors_graph(mode="distance")

lab = Datalab(data={"X": X, "y": y}, label_name="y")
lab.find_issues(knn_graph=knn_graph)

Selecting Specific Issue Types

from cleanlab import Datalab

# Only check for label issues with custom configuration
issue_types = {
    "label": {
        "clean_learning_kwargs": {
            "prune_method": "prune_by_noise_rate",
        },
    },
}
lab.find_issues(pred_probs=pred_probs, issue_types=issue_types)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment