Heuristic:Evidentlyai Evidently Column Type Inference Rules
| Knowledge Sources | |
|---|---|
| Domains | Data_Quality, Feature_Engineering |
| Last Updated | 2026-02-14 10:00 GMT |
Overview
Cardinality-based heuristics for automatically inferring column types (Numerical, Categorical, Text, Datetime) from pandas DataFrame column properties.
Description
When users do not explicitly specify column types via `ColumnMapping`, Evidently infers types automatically using the column's dtype and cardinality (number of unique values). Two separate inference systems exist: the legacy system (`data_preprocessing.py`) uses a threshold of 5 unique values for integer-to-categorical conversion, while the newer system (`datasets.py`) uses a threshold of 10 for integers and a 50% uniqueness ratio for strings. Understanding these heuristics is critical because incorrect type inference leads to wrong statistical test selection and misleading drift results.
Usage
This heuristic applies whenever you create a `Dataset` or run metrics without explicit column type mapping. Override these defaults using `ColumnMapping` when automatic inference produces incorrect results (e.g., ZIP codes detected as numerical, or low-cardinality floats that should be categorical).
The Insight (Rule of Thumb)
New System (datasets.py)
- Float columns: Always Numerical.
- Integer columns, nunique <= 10: Categorical (the `INTEGER_CARDINALITY_LIMIT`).
- Integer columns, nunique > 10: Numerical.
- String columns, nunique > 50% of count: Text (high cardinality strings are free-form text).
- String columns, nunique <= 50% of count: Categorical (low cardinality strings are labels).
- Object columns with string values: Same 50% uniqueness rule as string columns.
- Boolean and category dtype: Always Categorical.
- Datetime dtype: Always Datetime.
- Object columns with list/tuple values: List type.
Legacy System (data_preprocessing.py)
- Integer columns, nunique <= 5: Categorical (the `NUMBER_UNIQUE_AS_CATEGORICAL` constant).
- Integer columns, nunique > 5: Numerical.
- Target column special case: If task is regression or numeric with > 5 unique values, treat as Numerical; otherwise Categorical.
- Prediction column special case: If string, always Categorical. If integer with <= 5 unique values and task is not regression, Categorical.
- Trade-off: Conservative thresholds may misclassify some columns. For example, an integer column with 8 unique values would be Numerical in the legacy system but Categorical in the new system.
Reasoning
The cardinality threshold distinguishes between truly continuous distributions (many unique values) and discrete categories encoded as numbers. A column with 3-5 unique integer values (e.g., a Likert scale or star rating) behaves more like a categorical feature for drift detection purposes. The 50% uniqueness ratio for strings follows the intuition that labels repeat frequently while free-form text is mostly unique.
The legacy threshold of 5 is conservative (minimizes false categoricals), while the new threshold of 10 catches more edge cases like encoded week numbers or small enums. The 50% text threshold is a practical balance — text columns like names or descriptions rarely have more than 50% duplicates, while category columns like "status" or "country" have high repetition.
Code Evidence
New inference system from `src/evidently/core/datasets.py:1426-1458`:
INTEGER_CARDINALITY_LIMIT = 10
def infer_column_type(column_data: pd.Series) -> ColumnType:
if column_data.dtype.name.startswith("float"):
return ColumnType.Numerical
if column_data.dtype.name.startswith("int"):
if column_data.nunique() <= INTEGER_CARDINALITY_LIMIT:
return ColumnType.Categorical
else:
return ColumnType.Numerical
if column_data.dtype.name in ["string", "str"]:
if column_data.nunique() > (column_data.count() * 0.5):
return ColumnType.Text
else:
return ColumnType.Categorical
Legacy inference system from `src/evidently/legacy/utils/data_preprocessing.py:556-649`:
NUMBER_UNIQUE_AS_CATEGORICAL = 5
def _get_column_type(column_name, data, mapping=None, cardinality_limit=None):
# ...
if pd.api.types.is_integer_dtype(column_dtype):
nunique = ref_unique or cur_unique
if nunique is not None and nunique <= NUMBER_UNIQUE_AS_CATEGORICAL:
return ColumnType.Categorical
Type mismatch warning from `src/evidently/legacy/utils/data_preprocessing.py:590-604`:
if ref_type != cur_type:
available_set = ["i", "u", "f", "c", "m", "M"]
if ref_type.kind not in available_set or cur_type.kind not in available_set:
logging.warning(
f"Column {column_name} have different types in reference {ref_type} "
f"and current {cur_type}. Returning type from reference"
)
cur_type = ref_type