Principle:Cleanlab Cleanlab Identifier Column Issue Detection
| Knowledge Sources | |
|---|---|
| Domains | Data Quality, Feature Engineering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Identifier column issue detection is the practice of identifying feature columns that contain sequential identifiers (such as row indices or database primary keys) rather than meaningful predictive information, preventing models from learning spurious orderings.
Description
In many real-world datasets, especially those exported from databases or spreadsheets, identifier columns such as auto-incremented primary keys, row indices, or unique record IDs can inadvertently be included among the feature columns used for model training. These columns encode the ordering of data collection rather than any genuine relationship with the target variable.
When a model is trained on such features, it may learn to associate specific identifier ranges with specific labels -- a form of data leakage that produces artificially high training accuracy but poor generalization to new data. This is particularly insidious because:
- The identifier column may have a spurious correlation with the label if data was collected in a non-random order (e.g., all positive examples first).
- Even without correlation, the model wastes capacity on a non-informative feature.
- The issue is not detected by standard validation if the test set identifiers fall within the training range.
Identifier column detection is therefore a critical data quality check that should be performed before any modeling work begins.
Usage
Apply identifier column detection when:
- Working with data exported from relational databases where primary key columns may be included.
- Processing CSV or spreadsheet data that may contain row number columns.
- Auditing feature sets to ensure all columns carry genuine predictive signal.
- Building automated data quality pipelines where feature validation is needed upstream of model training.
Theoretical Basis
The detection algorithm is based on the following definition and test:
Definition: A column C in a dataset with n rows is an identifier column if and only if:
- All values in
Care integers. - The set of unique values in
Cforms a contiguous integer sequence:{c, c+1, c+2, ..., c+n-1}for some integerc.
Detection procedure:
- Extract the unique sorted values from the column:
U = sort(unique(C)). - Compute the expected contiguous range:
E = [min(U), min(U)+1, ..., max(U)]. - Verify that
|U| = |E|(no gaps or duplicates beyond uniqueness) andU = Eelement-wise. - Verify that
|U| > 1to exclude degenerate single-value columns.
This test runs in O(n log n) time per column due to the sorting step and requires no model training, making it an efficient pre-processing check.
Scoring: Because this is a dataset-level issue (not per-row), the summary score is binary:
- 1.0 if no identifier columns are found (dataset is clean with respect to this issue).
- 0.0 if one or more identifier columns are found.
The per-row score is always 1.0 since no individual example is more or less affected by the presence of an identifier column.