Implementation:Cleanlab Cleanlab Identifier Column Detection

Knowledge Sources	Cleanlab
Domains	Data Quality, Feature Engineering
Last Updated	2026-02-09 00:00 GMT

Overview

IdentifierColumnIssueManager detects whether any feature columns in a dataset are identifier columns (such as auto-incremented row IDs or database primary keys) that should not be used as modeling features.

Description

The IdentifierColumnIssueManager class extends IssueManager with issue_name = "identifier_column". Unlike most issue managers in Datalab that operate on a per-row basis, this manager operates at the column level. It inspects each integer-typed feature column to determine whether its values form a contiguous integer sequence of the form {c, c+1, ..., c+n} where n equals the number of rows. If such a column is found, it indicates that the column is likely an identifier (e.g., row index, database primary key) rather than a meaningful feature.

Because this is a dataset-level issue rather than a per-example issue, all rows receive a score of 1.0 and is_identifier_column_issue is set to False for every row. The summary score is 0.0 if any identifier column is found, and 1.0 otherwise. The info dictionary records which column indices are identifier columns and how many were found.

Usage

Use IdentifierColumnIssueManager when auditing a dataset before model training to ensure that no identifier columns have leaked into the feature set. This is particularly important when working with data exported from databases where primary key columns may inadvertently be included. Such columns can cause models to memorize row orderings rather than learn genuine patterns, leading to poor generalization.

Code Reference

Source Location

Repository: Cleanlab
File: cleanlab/datalab/internal/issue_manager/identifier_column.py
Lines: 1-131

Signature

class IdentifierColumnIssueManager(IssueManager):
    description: ClassVar[str] = """Checks whether there is an identifier_column in the features of a dataset..."""
    issue_name: ClassVar[str] = "identifier_column"
    verbosity_levels = {0: [], 1: ["identifier_columns"], 2: []}

    def _is_sequential(self, arr: npt.NDArray) -> bool: ...
    def _prepare_features(
        self, features: Optional[Union[npt.NDArray, pd.DataFrame, list, dict]]
    ) -> Union[npt.NDArray, List[npt.NDArray]]: ...
    def find_issues(
        self, features: Optional[Union[npt.NDArray, pd.DataFrame, list, dict]], **kwargs
    ) -> None: ...

Import

from cleanlab.datalab.internal.issue_manager.identifier_column import IdentifierColumnIssueManager

I/O Contract

Inputs

Name	Type	Required	Description
features	pd.DataFrame \| list \| dict]	Yes	The dataset features to check for identifier columns. Accepts NumPy arrays, pandas DataFrames, lists of lists/arrays, or dictionaries of column values.

Outputs

Name	Type	Description
self.issues	`pd.DataFrame`	DataFrame with `is_identifier_column_issue` (always `False`) and `identifier_column_score` (always 1.0) per row, since this is a dataset-level issue.
self.summary	`pd.DataFrame`	Summary DataFrame with a score of 0.0 if any identifier column is found, 1.0 otherwise.
self.info	`dict`	Dictionary containing `identifier_columns` (list of column indices) and `num_identifier_columns` (count of identifier columns).

Internal Methods

_is_sequential

Checks if the elements of an array form a contiguous integer sequence. It sorts the unique values, computes the expected range from minimum to maximum, and verifies all values match. Returns False for empty arrays or arrays with a single unique value.

_prepare_features

Normalizes various input formats into a list of per-column NumPy arrays. For NumPy arrays, it transposes rows to columns. For DataFrames, it extracts each column while preserving string dtype. For dicts, it converts each value list to an array. For lists, it validates that each element is a list or array.

Usage Examples

Basic Usage

import numpy as np
from cleanlab import Datalab

# Suppose your dataset has an ID column that is sequential
data = {
    "id": [0, 1, 2, 3, 4],
    "feature_a": [1.2, 3.4, 5.6, 7.8, 9.0],
    "label": ["cat", "dog", "cat", "dog", "cat"],
}

# When Datalab runs its suite of issue checks, the IdentifierColumnIssueManager
# will flag the "id" column as an identifier column if it appears in the features.
lab = Datalab(data=data, label_name="label")
lab.find_issues()
lab.report()

Related Pages

Principle:Cleanlab_Cleanlab_Identifier_Column_Issue_Detection

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment