Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Cleanlab Cleanlab Identifier Column Detection

From Leeroopedia


Knowledge Sources
Domains Data Quality, Feature Engineering
Last Updated 2026-02-09 00:00 GMT

Overview

IdentifierColumnIssueManager detects whether any feature columns in a dataset are identifier columns (such as auto-incremented row IDs or database primary keys) that should not be used as modeling features.

Description

The IdentifierColumnIssueManager class extends IssueManager with issue_name = "identifier_column". Unlike most issue managers in Datalab that operate on a per-row basis, this manager operates at the column level. It inspects each integer-typed feature column to determine whether its values form a contiguous integer sequence of the form {c, c+1, ..., c+n} where n equals the number of rows. If such a column is found, it indicates that the column is likely an identifier (e.g., row index, database primary key) rather than a meaningful feature.

Because this is a dataset-level issue rather than a per-example issue, all rows receive a score of 1.0 and is_identifier_column_issue is set to False for every row. The summary score is 0.0 if any identifier column is found, and 1.0 otherwise. The info dictionary records which column indices are identifier columns and how many were found.

Usage

Use IdentifierColumnIssueManager when auditing a dataset before model training to ensure that no identifier columns have leaked into the feature set. This is particularly important when working with data exported from databases where primary key columns may inadvertently be included. Such columns can cause models to memorize row orderings rather than learn genuine patterns, leading to poor generalization.

Code Reference

Source Location

  • Repository: Cleanlab
  • File: cleanlab/datalab/internal/issue_manager/identifier_column.py
  • Lines: 1-131

Signature

class IdentifierColumnIssueManager(IssueManager):
    description: ClassVar[str] = """Checks whether there is an identifier_column in the features of a dataset..."""
    issue_name: ClassVar[str] = "identifier_column"
    verbosity_levels = {0: [], 1: ["identifier_columns"], 2: []}

    def _is_sequential(self, arr: npt.NDArray) -> bool: ...
    def _prepare_features(
        self, features: Optional[Union[npt.NDArray, pd.DataFrame, list, dict]]
    ) -> Union[npt.NDArray, List[npt.NDArray]]: ...
    def find_issues(
        self, features: Optional[Union[npt.NDArray, pd.DataFrame, list, dict]], **kwargs
    ) -> None: ...

Import

from cleanlab.datalab.internal.issue_manager.identifier_column import IdentifierColumnIssueManager

I/O Contract

Inputs

Name Type Required Description
features pd.DataFrame | list | dict] Yes The dataset features to check for identifier columns. Accepts NumPy arrays, pandas DataFrames, lists of lists/arrays, or dictionaries of column values.

Outputs

Name Type Description
self.issues pd.DataFrame DataFrame with is_identifier_column_issue (always False) and identifier_column_score (always 1.0) per row, since this is a dataset-level issue.
self.summary pd.DataFrame Summary DataFrame with a score of 0.0 if any identifier column is found, 1.0 otherwise.
self.info dict Dictionary containing identifier_columns (list of column indices) and num_identifier_columns (count of identifier columns).

Internal Methods

_is_sequential

Checks if the elements of an array form a contiguous integer sequence. It sorts the unique values, computes the expected range from minimum to maximum, and verifies all values match. Returns False for empty arrays or arrays with a single unique value.

_prepare_features

Normalizes various input formats into a list of per-column NumPy arrays. For NumPy arrays, it transposes rows to columns. For DataFrames, it extracts each column while preserving string dtype. For dicts, it converts each value list to an array. For lists, it validates that each element is a list or array.

Usage Examples

Basic Usage

import numpy as np
from cleanlab import Datalab

# Suppose your dataset has an ID column that is sequential
data = {
    "id": [0, 1, 2, 3, 4],
    "feature_a": [1.2, 3.4, 5.6, 7.8, 9.0],
    "label": ["cat", "dog", "cat", "dog", "cat"],
}

# When Datalab runs its suite of issue checks, the IdentifierColumnIssueManager
# will flag the "id" column as an identifier column if it appears in the features.
lab = Datalab(data=data, label_name="label")
lab.find_issues()
lab.report()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment