Implementation:Cleanlab Cleanlab Multilabel Find Label Issues
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Data Quality, Multi-Label Classification |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
The find_label_issues and find_multilabel_issues_per_class functions identify potentially mislabeled examples in multi-label classification datasets by decomposing the problem into independent binary classification subproblems per class.
Description
This module in cleanlab.multilabel_classification.filter provides two public functions:
- find_label_issues: Identifies examples where any class appears to be incorrectly annotated. It delegates to the internal _find_label_issues_multilabel function from cleanlab.filter. Returns either a boolean mask (where True indicates a label issue) or an array of indices sorted by likelihood of mislabeling. Supports a low_memory mode that uses batched label issue detection for large datasets.
- find_multilabel_issues_per_class: Provides finer-grained analysis by determining which specific classes are incorrectly annotated for each example. For each of the K classes, it constructs a binary (one-vs-rest) label vector and a complementary predicted probability matrix, then runs standard binary cleanlab.filter.find_label_issues on each subproblem. Can return boolean masks per class or ranked index lists along with per-class labels and prediction probabilities.
Both functions accept a confident_joint parameter in the (K, 2, 2) one-vs-rest format, which captures the estimated joint distribution of noisy and true labels for each class independently.
Usage
Import find_label_issues when you need a single boolean mask or ranked list identifying which examples have any mislabeled class. Import find_multilabel_issues_per_class when you need to know exactly which classes are mislabeled for each example, which is useful for targeted re-annotation or for feeding into dataset summary functions.
Code Reference
Source Location
- Repository: Cleanlab
- File: cleanlab/multilabel_classification/filter.py
- Lines: 1-303
Signature
def find_label_issues(
labels: list,
pred_probs: np.ndarray,
return_indices_ranked_by: Optional[str] = None,
rank_by_kwargs={},
filter_by: str = "prune_by_noise_rate",
frac_noise: float = 1.0,
num_to_remove_per_class: Optional[List[int]] = None,
min_examples_per_class=1,
confident_joint: Optional[np.ndarray] = None,
n_jobs: Optional[int] = None,
verbose: bool = False,
low_memory: bool = False,
) -> np.ndarray
def find_multilabel_issues_per_class(
labels: list,
pred_probs: np.ndarray,
return_indices_ranked_by: Optional[str] = None,
rank_by_kwargs={},
filter_by: str = "prune_by_noise_rate",
frac_noise: float = 1.0,
num_to_remove_per_class: Optional[List[int]] = None,
min_examples_per_class=1,
confident_joint: Optional[np.ndarray] = None,
n_jobs: Optional[int] = None,
verbose: bool = False,
low_memory: bool = False,
) -> Union[np.ndarray, Tuple[List[np.ndarray], List[Any], List[np.ndarray]]]
Import
from cleanlab.multilabel_classification.filter import find_label_issues
from cleanlab.multilabel_classification.filter import find_multilabel_issues_per_class
I/O Contract
Inputs (find_label_issues)
| Name | Type | Required | Description |
|---|---|---|---|
| labels | List[List[int]] | Yes | List of noisy labels where each element is a list of class indices the example belongs to (e.g. [[1,2],[1],[0],...]).
|
| pred_probs | np.ndarray (N, K) | Yes | Model-predicted class probabilities. Columns need not sum to 1. Should ideally be out-of-sample predictions from cross-validation. |
| return_indices_ranked_by | str or None | No | If None, returns a boolean mask. Otherwise one of: 'self_confidence', 'normalized_margin', 'confidence_weighted_entropy'. Returns sorted indices. |
| rank_by_kwargs | dict | No | Extra keyword arguments for the ranking scoring function. |
| filter_by | str | No | Confident learning method for filtering: 'prune_by_noise_rate' (default), 'prune_by_class', 'both', 'confident_learning', 'predicted_neq_given', 'low_normalized_margin', 'low_self_confidence'. |
| frac_noise | float | No | Fraction of estimated label errors to return (default 1.0 = all). |
| num_to_remove_per_class | List[int] | No | Number of mislabeled examples to return per class. |
| min_examples_per_class | int | No | Minimum examples per class below which no issues are flagged (default 1). |
| confident_joint | np.ndarray (K, 2, 2) | No | One-vs-rest confident joint. Auto-computed if not provided. |
| n_jobs | int | No | Number of parallel processing threads. |
| verbose | bool | No | If True, prints multiprocessing info. |
| low_memory | bool | No | If True, uses batched detection for large datasets. |
Outputs (find_label_issues)
| Name | Type | Description |
|---|---|---|
| label_issues | np.ndarray | If return_indices_ranked_by is None: boolean mask of shape (N,) where True indicates a label issue. Otherwise: array of indices sorted by likelihood of mislabeling. |
Inputs (find_multilabel_issues_per_class)
Parameters are identical to find_label_issues above.
Outputs (find_multilabel_issues_per_class)
| Name | Type | Description |
|---|---|---|
| per_class_label_issues | np.ndarray or Tuple | If return_indices_ranked_by is None: boolean array of shape (N, K) where True at position (i, k) means class k is mislabeled for example i. If not None: returns a tuple of (label_issues_list, labels_list, pred_probs_list), each a list of length K. |
Usage Examples
Basic Usage: Get Boolean Mask
from cleanlab.multilabel_classification.filter import find_label_issues
import numpy as np
labels = [[0, 1], [1], [0, 2], [2], [0, 1, 2]]
pred_probs = np.array([
[0.9, 0.8, 0.1],
[0.2, 0.9, 0.1],
[0.8, 0.1, 0.7],
[0.1, 0.2, 0.9],
[0.7, 0.8, 0.6],
])
# Returns boolean mask: True = label issue for any class
issue_mask = find_label_issues(labels=labels, pred_probs=pred_probs)
print("Examples with issues:", np.where(issue_mask)[0])
Per-Class Analysis
from cleanlab.multilabel_classification.filter import find_multilabel_issues_per_class
# Returns (N, K) boolean array showing which specific classes have issues
per_class_issues = find_multilabel_issues_per_class(labels=labels, pred_probs=pred_probs)
for k in range(pred_probs.shape[1]):
print(f"Class {k} issues in examples:", np.where(per_class_issues[:, k])[0])
Ranked Indices
# Get indices ranked by self_confidence (most likely mislabeled first)
ranked_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs,
return_indices_ranked_by="self_confidence",
)
print("Issue indices (ranked):", ranked_issues)