Implementation:Fastai Fastbook TabularPandas
| Knowledge Sources | |
|---|---|
| Domains | Tabular Data, Data Preprocessing |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Concrete tool for tabular data preprocessing and train/validation splitting provided by fastai. It wraps a pandas DataFrame and applies a pipeline of TabularProc transforms (Categorify, FillMissing, Normalize) in place.
Description
TabularPandas is a fastai class that wraps a pandas DataFrame and provides a convenient interface for applying preprocessing transforms, splitting into training and validation sets, and accessing features and targets. It differs from standard scikit-learn transforms in two key ways: (1) transforms modify the DataFrame in place rather than returning a new object, and (2) transforms run eagerly at construction time rather than lazily at access time.
The companion function cont_cat_split automatically classifies columns as continuous or categorical based on their cardinality (number of unique values). Columns with fewer unique values than max_card are treated as categorical; the rest are continuous.
Usage
Use TabularPandas after feature engineering (e.g., after add_datepart) and before model training. It is the central data object for both tree-based and deep learning tabular workflows in fastai. For neural networks, add Normalize to the processor list. For tree-based models, omit it.
Code Reference
Source Location
- Repository: fastbook
- File: translations/cn/09_tabular.md (Lines 346-377)
- Library source: fastai.tabular.core
Signature
# Automatic column classification
cont_cat_split(df, max_card=20, dep_var=None)
# Main preprocessing wrapper
TabularPandas(df, procs=None, cat_names=None, cont_names=None,
y_names=None, y_block=None, splits=None, do_setup=True,
inplace=False, reduce_memory=True)
Import
from fastai.tabular.all import (TabularPandas, Categorify, FillMissing,
Normalize, cont_cat_split)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| df | pandas.DataFrame | Yes | The DataFrame to preprocess. Must already have date features extracted and ordinal categories set. |
| procs | list of TabularProc classes | Yes | List of preprocessing transforms to apply. Common choices: [Categorify, FillMissing] for trees, [Categorify, FillMissing, Normalize] for neural nets.
|
| cat_names | list of str | Yes | Names of categorical columns. Can be obtained from cont_cat_split.
|
| cont_names | list of str | Yes | Names of continuous columns. Can be obtained from cont_cat_split.
|
| y_names | str or list of str | Yes | Name(s) of the dependent variable column(s). |
| splits | tuple of (list, list) | No | Tuple of (train_indices, valid_indices). If None, uses a random split. |
| max_card (for cont_cat_split) | int | No | Maximum cardinality threshold. Columns with fewer unique values than this are treated as categorical. Default 20. |
| dep_var (for cont_cat_split) | str | No | Name of dependent variable to exclude from the feature classification. |
Outputs
| Name | Type | Description |
|---|---|---|
| to.train | TabularPandas subset | Training split with .xs (features DataFrame) and .y (target Series) attributes.
|
| to.valid | TabularPandas subset | Validation split with .xs (features DataFrame) and .y (target Series) attributes.
|
| to.classes | dict | Dictionary mapping categorical column names to their category lists (including #na# for missing).
|
| to.items | pandas.DataFrame | The fully transformed underlying DataFrame. |
| to.dataloaders(bs) | DataLoaders | Creates PyTorch DataLoaders for neural network training. |
Usage Examples
Basic Usage
from fastai.tabular.all import *
import pandas as pd
import numpy as np
# Assume df is already loaded and add_datepart has been applied
dep_var = 'SalePrice'
# Define processors for tree-based models
procs = [Categorify, FillMissing]
# Create time-based split: train on data before Oct 2011, validate after
cond = (df.saleYear < 2011) | (df.saleMonth < 10)
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]
splits = (list(train_idx), list(valid_idx))
# Automatically classify columns as continuous or categorical
cont, cat = cont_cat_split(df, max_card=1, dep_var=dep_var)
# Create the TabularPandas object
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)
# Access training and validation data
xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y
print(f"Training: {len(xs)} rows, {len(xs.columns)} features")
print(f"Validation: {len(valid_xs)} rows")
For Neural Networks (with Normalize)
# Add Normalize for neural network preprocessing
procs_nn = [Categorify, FillMissing, Normalize]
cont_nn, cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)
to_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn,
splits=splits, y_names=dep_var)
# Create DataLoaders with large batch size for tabular data
dls = to_nn.dataloaders(1024)