Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Fastai Fastbook Tabular Preprocessing

From Leeroopedia


Knowledge Sources
Domains Tabular Data, Data Preprocessing, Machine Learning
Last Updated 2026-02-09 17:00 GMT

Overview

Tabular preprocessing is the set of transformations applied to raw DataFrame columns to convert them into a numeric, complete, and properly scaled representation that machine learning algorithms can consume.

Description

Most machine learning algorithms, whether tree-based or gradient-based, require their input data to be fully numeric and free of missing values. Raw tabular data rarely satisfies these constraints: it typically contains string-valued categorical columns, missing entries, and numeric columns at vastly different scales. Tabular preprocessing addresses these three issues through a pipeline of transforms:

  • Categorification: String-valued or low-cardinality columns are mapped to integer codes (0, 1, 2, ...). This is distinct from one-hot encoding; the integer codes are compact and can be fed directly to tree-based models or used as indices into embedding matrices in neural networks. Unknown categories at inference time are assigned a special code.
  • Missing value imputation: Missing entries are replaced with a fill value (commonly the column median) and a companion boolean column is created to indicate which rows had missing data. This preserves the information that a value was missing, which can itself be a useful predictor (or a signal of data leakage).
  • Normalization: Continuous columns are shifted and scaled so that they have approximately zero mean and unit standard deviation. This is critical for neural networks (where gradient magnitudes depend on input scale) but unnecessary for tree-based models.

An additional concern is train/validation splitting. The split must be chosen to reflect the relationship between training data and the data the model will encounter at inference time. For time-series data, this means the validation set should cover a later time period than the training set, not a random subset.

Usage

Apply tabular preprocessing after feature engineering and before model training. The specific transforms depend on the downstream model:

  • For tree-based models (Random Forest, Gradient Boosting): Use Categorify and FillMissing. Normalization is not needed because trees only consider the rank order of values.
  • For neural networks: Use Categorify, FillMissing, and Normalize. Neural networks are sensitive to input scale.
  • For all models: Define train/validation splits that mirror the train/test relationship. For temporal data, use time-based splits.

Theoretical Basis

Categorification

Given a column with k distinct string values, Categorify assigns each unique value an integer in the range [1, k]. The value 0 is reserved for unknown/unseen categories and for NaN. This mapping is learned on the training set and applied identically to the validation set. Formally:

Let C = {c_1, c_2, ..., c_k} be the set of unique values observed in the training column. The mapping function is:

 f(x) = i  if x = c_i
 f(x) = 0  if x is not in C or x is NaN

Missing Value Imputation

For a numeric column x with n non-missing values, FillMissing computes the median m of those values. Every missing entry is replaced with m. A new boolean column x_na is created where:

 x_na[i] = True   if x[i] was originally missing
 x_na[i] = False  otherwise

The median is preferred over the mean because it is robust to outliers. The boolean column ensures the model can learn to treat imputed rows differently from genuinely observed rows.

Normalization

For a continuous column x, Normalize computes the mean (mu) and standard deviation (sigma) on the training set. Every value is then transformed to:

 x_normalized = (x - mu) / sigma

This ensures that all continuous features are on a comparable scale, which helps neural network optimizers converge faster and avoids numerical issues with large or small values.

Time-Based Splitting

For temporal data, a random split would leak future information into the training set. Instead, a cutoff date t is chosen such that:

 Training set:   all rows where date < t
 Validation set: all rows where date >= t

This simulates the real-world scenario where the model is trained on historical data and must predict future outcomes.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment