Implementation:Cleanlab Cleanlab Identifier Column Detection
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Feature Engineering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
IdentifierColumnIssueManager detects whether any feature columns in a dataset are identifier columns (such as auto-incremented row IDs or database primary keys) that should not be used as modeling features.
Description
The IdentifierColumnIssueManager class extends IssueManager with issue_name = "identifier_column". Unlike most issue managers in Datalab that operate on a per-row basis, this manager operates at the column level. It inspects each integer-typed feature column to determine whether its values form a contiguous integer sequence of the form {c, c+1, ..., c+n} where n equals the number of rows. If such a column is found, it indicates that the column is likely an identifier (e.g., row index, database primary key) rather than a meaningful feature.
Because this is a dataset-level issue rather than a per-example issue, all rows receive a score of 1.0 and is_identifier_column_issue is set to False for every row. The summary score is 0.0 if any identifier column is found, and 1.0 otherwise. The info dictionary records which column indices are identifier columns and how many were found.
Usage
Use IdentifierColumnIssueManager when auditing a dataset before model training to ensure that no identifier columns have leaked into the feature set. This is particularly important when working with data exported from databases where primary key columns may inadvertently be included. Such columns can cause models to memorize row orderings rather than learn genuine patterns, leading to poor generalization.
Code Reference
Source Location
- Repository: Cleanlab
- File: cleanlab/datalab/internal/issue_manager/identifier_column.py
- Lines: 1-131
Signature
class IdentifierColumnIssueManager(IssueManager):
description: ClassVar[str] = """Checks whether there is an identifier_column in the features of a dataset..."""
issue_name: ClassVar[str] = "identifier_column"
verbosity_levels = {0: [], 1: ["identifier_columns"], 2: []}
def _is_sequential(self, arr: npt.NDArray) -> bool: ...
def _prepare_features(
self, features: Optional[Union[npt.NDArray, pd.DataFrame, list, dict]]
) -> Union[npt.NDArray, List[npt.NDArray]]: ...
def find_issues(
self, features: Optional[Union[npt.NDArray, pd.DataFrame, list, dict]], **kwargs
) -> None: ...
Import
from cleanlab.datalab.internal.issue_manager.identifier_column import IdentifierColumnIssueManager
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| features | pd.DataFrame | list | dict] | Yes | The dataset features to check for identifier columns. Accepts NumPy arrays, pandas DataFrames, lists of lists/arrays, or dictionaries of column values. |
Outputs
| Name | Type | Description |
|---|---|---|
| self.issues | pd.DataFrame |
DataFrame with is_identifier_column_issue (always False) and identifier_column_score (always 1.0) per row, since this is a dataset-level issue.
|
| self.summary | pd.DataFrame |
Summary DataFrame with a score of 0.0 if any identifier column is found, 1.0 otherwise. |
| self.info | dict |
Dictionary containing identifier_columns (list of column indices) and num_identifier_columns (count of identifier columns).
|
Internal Methods
_is_sequential
Checks if the elements of an array form a contiguous integer sequence. It sorts the unique values, computes the expected range from minimum to maximum, and verifies all values match. Returns False for empty arrays or arrays with a single unique value.
_prepare_features
Normalizes various input formats into a list of per-column NumPy arrays. For NumPy arrays, it transposes rows to columns. For DataFrames, it extracts each column while preserving string dtype. For dicts, it converts each value list to an array. For lists, it validates that each element is a list or array.
Usage Examples
Basic Usage
import numpy as np
from cleanlab import Datalab
# Suppose your dataset has an ID column that is sequential
data = {
"id": [0, 1, 2, 3, 4],
"feature_a": [1.2, 3.4, 5.6, 7.8, 9.0],
"label": ["cat", "dog", "cat", "dog", "cat"],
}
# When Datalab runs its suite of issue checks, the IdentifierColumnIssueManager
# will flag the "id" column as an identifier column if it appears in the features.
lab = Datalab(data=data, label_name="label")
lab.find_issues()
lab.report()