Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Evidentlyai Evidently Column Type Inference Rules

From Leeroopedia
Knowledge Sources
Domains Data_Quality, Feature_Engineering
Last Updated 2026-02-14 10:00 GMT

Overview

Cardinality-based heuristics for automatically inferring column types (Numerical, Categorical, Text, Datetime) from pandas DataFrame column properties.

Description

When users do not explicitly specify column types via `ColumnMapping`, Evidently infers types automatically using the column's dtype and cardinality (number of unique values). Two separate inference systems exist: the legacy system (`data_preprocessing.py`) uses a threshold of 5 unique values for integer-to-categorical conversion, while the newer system (`datasets.py`) uses a threshold of 10 for integers and a 50% uniqueness ratio for strings. Understanding these heuristics is critical because incorrect type inference leads to wrong statistical test selection and misleading drift results.

Usage

This heuristic applies whenever you create a `Dataset` or run metrics without explicit column type mapping. Override these defaults using `ColumnMapping` when automatic inference produces incorrect results (e.g., ZIP codes detected as numerical, or low-cardinality floats that should be categorical).

The Insight (Rule of Thumb)

New System (datasets.py)

  • Float columns: Always Numerical.
  • Integer columns, nunique <= 10: Categorical (the `INTEGER_CARDINALITY_LIMIT`).
  • Integer columns, nunique > 10: Numerical.
  • String columns, nunique > 50% of count: Text (high cardinality strings are free-form text).
  • String columns, nunique <= 50% of count: Categorical (low cardinality strings are labels).
  • Object columns with string values: Same 50% uniqueness rule as string columns.
  • Boolean and category dtype: Always Categorical.
  • Datetime dtype: Always Datetime.
  • Object columns with list/tuple values: List type.

Legacy System (data_preprocessing.py)

  • Integer columns, nunique <= 5: Categorical (the `NUMBER_UNIQUE_AS_CATEGORICAL` constant).
  • Integer columns, nunique > 5: Numerical.
  • Target column special case: If task is regression or numeric with > 5 unique values, treat as Numerical; otherwise Categorical.
  • Prediction column special case: If string, always Categorical. If integer with <= 5 unique values and task is not regression, Categorical.
  • Trade-off: Conservative thresholds may misclassify some columns. For example, an integer column with 8 unique values would be Numerical in the legacy system but Categorical in the new system.

Reasoning

The cardinality threshold distinguishes between truly continuous distributions (many unique values) and discrete categories encoded as numbers. A column with 3-5 unique integer values (e.g., a Likert scale or star rating) behaves more like a categorical feature for drift detection purposes. The 50% uniqueness ratio for strings follows the intuition that labels repeat frequently while free-form text is mostly unique.

The legacy threshold of 5 is conservative (minimizes false categoricals), while the new threshold of 10 catches more edge cases like encoded week numbers or small enums. The 50% text threshold is a practical balance — text columns like names or descriptions rarely have more than 50% duplicates, while category columns like "status" or "country" have high repetition.

Code Evidence

New inference system from `src/evidently/core/datasets.py:1426-1458`:

INTEGER_CARDINALITY_LIMIT = 10

def infer_column_type(column_data: pd.Series) -> ColumnType:
    if column_data.dtype.name.startswith("float"):
        return ColumnType.Numerical
    if column_data.dtype.name.startswith("int"):
        if column_data.nunique() <= INTEGER_CARDINALITY_LIMIT:
            return ColumnType.Categorical
        else:
            return ColumnType.Numerical
    if column_data.dtype.name in ["string", "str"]:
        if column_data.nunique() > (column_data.count() * 0.5):
            return ColumnType.Text
        else:
            return ColumnType.Categorical

Legacy inference system from `src/evidently/legacy/utils/data_preprocessing.py:556-649`:

NUMBER_UNIQUE_AS_CATEGORICAL = 5

def _get_column_type(column_name, data, mapping=None, cardinality_limit=None):
    # ...
    if pd.api.types.is_integer_dtype(column_dtype):
        nunique = ref_unique or cur_unique
        if nunique is not None and nunique <= NUMBER_UNIQUE_AS_CATEGORICAL:
            return ColumnType.Categorical

Type mismatch warning from `src/evidently/legacy/utils/data_preprocessing.py:590-604`:

if ref_type != cur_type:
    available_set = ["i", "u", "f", "c", "m", "M"]
    if ref_type.kind not in available_set or cur_type.kind not in available_set:
        logging.warning(
            f"Column {column_name} have different types in reference {ref_type} "
            f"and current {cur_type}. Returning type from reference"
        )
        cur_type = ref_type

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment