Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Cleanlab Cleanlab Datalab Find Issues

From Leeroopedia


Field Value
Sources Cleanlab
Domains Data_Quality, Dataset_Auditing, Machine_Learning
Last Updated 2026-02-09 12:00 GMT

Overview

Datalab_Find_Issues is the core audit method that orchestrates multiple specialized issue detectors to find diverse quality problems in a dataset from a single API call.

Description

The Datalab.find_issues method accepts optional model outputs (predicted probabilities, feature embeddings, precomputed nearest neighbor graph) and runs a battery of specialized issue managers against the dataset. An IssueFinder orchestrator determines which managers can run given the available inputs and executes them in sequence. Each manager computes per-example boolean flags and numeric severity scores, which are all stored in the Datalab instance's internal DataIssues container. The method mutates the Datalab instance in-place and does not return a value.

If an empty issue_types dictionary is passed, a warning is emitted and no issues are detected. If issue_types is None, the default set of issue types for the configured task is used.

Usage

Call this method on an initialized Datalab instance after training your model and obtaining out-of-sample predicted probabilities. Provide as many inputs as you have available to maximize the types of issues that can be detected.

Code Reference

Source Location

Repository
cleanlab/cleanlab
File
cleanlab/datalab/datalab.py
Lines
151--158

Signature

def find_issues(
    self,
    *,
    pred_probs: Optional[np.ndarray] = None,
    features: Optional[npt.NDArray] = None,
    knn_graph: Optional[csr_matrix] = None,
    issue_types: Optional[Dict[str, Any]] = None,
) -> None

Import

from cleanlab import Datalab
# find_issues is a method of the Datalab instance

I/O Contract

Inputs

Name Type Required Description
pred_probs Optional[np.ndarray] No Out-of-sample predicted class probabilities with shape (N, K) for classification, (N,) for regression, or (N, K) for multilabel. Columns must be ordered by lexicographically sorted class names. Enables label issue detection.
features Optional[np.ndarray] No Feature embeddings with shape (N, D). Enables outlier, duplicate, and non-IID issue detection. Used to construct a knn_graph if one is not provided.
knn_graph Optional[scipy.sparse.csr_matrix] No Precomputed k-nearest-neighbor distance graph as a sparse CSR matrix with shape (N, N). Non-zero entries represent distances. Takes precedence over features if both are provided.
issue_types Optional[Dict[str, Any]] No Dictionary specifying which issue types to check and their configuration. Keys are issue type names, values are dicts of keyword arguments for the corresponding IssueManager. If None, defaults are used.

Outputs

Name Type Description
return None The method mutates the Datalab instance in-place, populating self.data_issues with per-example issue flags, scores, and summary statistics. Access results via get_issues(), get_issue_summary(), or report().

Usage Examples

With Predicted Probabilities

from sklearn.linear_model import LogisticRegression
import numpy as np
from cleanlab import Datalab

X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]])
y = np.array([0, 1, 1, 0])
clf = LogisticRegression(random_state=0).fit(X, y)
pred_probs = clf.predict_proba(X)

lab = Datalab(data={"X": X, "y": y}, label_name="y")
lab.find_issues(pred_probs=pred_probs)

With Feature Embeddings

import numpy as np
from cleanlab import Datalab

X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]])
y = np.array([0, 1, 1, 0])

lab = Datalab(data={"X": X, "y": y}, label_name="y")
lab.find_issues(features=X)

With a Precomputed KNN Graph

from sklearn.neighbors import NearestNeighbors
import numpy as np
from cleanlab import Datalab

X = np.array([[0, 1], [1, 1], [2, 2], [2, 0]])
y = np.array([0, 1, 1, 0])

nbrs = NearestNeighbors(n_neighbors=2, metric="euclidean").fit(X)
knn_graph = nbrs.kneighbors_graph(mode="distance")

lab = Datalab(data={"X": X, "y": y}, label_name="y")
lab.find_issues(knn_graph=knn_graph)

Selecting Specific Issue Types

from cleanlab import Datalab

# Only check for label issues with custom configuration
issue_types = {
    "label": {
        "clean_learning_kwargs": {
            "prune_method": "prune_by_noise_rate",
        },
    },
}
lab.find_issues(pred_probs=pred_probs, issue_types=issue_types)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment