Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Fastai Fastbook TabularPandas

From Leeroopedia


Knowledge Sources
Domains Tabular Data, Data Preprocessing
Last Updated 2026-02-09 17:00 GMT

Overview

Concrete tool for tabular data preprocessing and train/validation splitting provided by fastai. It wraps a pandas DataFrame and applies a pipeline of TabularProc transforms (Categorify, FillMissing, Normalize) in place.

Description

TabularPandas is a fastai class that wraps a pandas DataFrame and provides a convenient interface for applying preprocessing transforms, splitting into training and validation sets, and accessing features and targets. It differs from standard scikit-learn transforms in two key ways: (1) transforms modify the DataFrame in place rather than returning a new object, and (2) transforms run eagerly at construction time rather than lazily at access time.

The companion function cont_cat_split automatically classifies columns as continuous or categorical based on their cardinality (number of unique values). Columns with fewer unique values than max_card are treated as categorical; the rest are continuous.

Usage

Use TabularPandas after feature engineering (e.g., after add_datepart) and before model training. It is the central data object for both tree-based and deep learning tabular workflows in fastai. For neural networks, add Normalize to the processor list. For tree-based models, omit it.

Code Reference

Source Location

  • Repository: fastbook
  • File: translations/cn/09_tabular.md (Lines 346-377)
  • Library source: fastai.tabular.core

Signature

# Automatic column classification
cont_cat_split(df, max_card=20, dep_var=None)

# Main preprocessing wrapper
TabularPandas(df, procs=None, cat_names=None, cont_names=None,
              y_names=None, y_block=None, splits=None, do_setup=True,
              inplace=False, reduce_memory=True)

Import

from fastai.tabular.all import (TabularPandas, Categorify, FillMissing,
                                 Normalize, cont_cat_split)

I/O Contract

Inputs

Name Type Required Description
df pandas.DataFrame Yes The DataFrame to preprocess. Must already have date features extracted and ordinal categories set.
procs list of TabularProc classes Yes List of preprocessing transforms to apply. Common choices: [Categorify, FillMissing] for trees, [Categorify, FillMissing, Normalize] for neural nets.
cat_names list of str Yes Names of categorical columns. Can be obtained from cont_cat_split.
cont_names list of str Yes Names of continuous columns. Can be obtained from cont_cat_split.
y_names str or list of str Yes Name(s) of the dependent variable column(s).
splits tuple of (list, list) No Tuple of (train_indices, valid_indices). If None, uses a random split.
max_card (for cont_cat_split) int No Maximum cardinality threshold. Columns with fewer unique values than this are treated as categorical. Default 20.
dep_var (for cont_cat_split) str No Name of dependent variable to exclude from the feature classification.

Outputs

Name Type Description
to.train TabularPandas subset Training split with .xs (features DataFrame) and .y (target Series) attributes.
to.valid TabularPandas subset Validation split with .xs (features DataFrame) and .y (target Series) attributes.
to.classes dict Dictionary mapping categorical column names to their category lists (including #na# for missing).
to.items pandas.DataFrame The fully transformed underlying DataFrame.
to.dataloaders(bs) DataLoaders Creates PyTorch DataLoaders for neural network training.

Usage Examples

Basic Usage

from fastai.tabular.all import *
import pandas as pd
import numpy as np

# Assume df is already loaded and add_datepart has been applied
dep_var = 'SalePrice'

# Define processors for tree-based models
procs = [Categorify, FillMissing]

# Create time-based split: train on data before Oct 2011, validate after
cond = (df.saleYear < 2011) | (df.saleMonth < 10)
train_idx = np.where( cond)[0]
valid_idx = np.where(~cond)[0]
splits = (list(train_idx), list(valid_idx))

# Automatically classify columns as continuous or categorical
cont, cat = cont_cat_split(df, max_card=1, dep_var=dep_var)

# Create the TabularPandas object
to = TabularPandas(df, procs, cat, cont, y_names=dep_var, splits=splits)

# Access training and validation data
xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

print(f"Training:   {len(xs)} rows, {len(xs.columns)} features")
print(f"Validation: {len(valid_xs)} rows")

For Neural Networks (with Normalize)

# Add Normalize for neural network preprocessing
procs_nn = [Categorify, FillMissing, Normalize]
cont_nn, cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)

to_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn,
                      splits=splits, y_names=dep_var)

# Create DataLoaders with large batch size for tabular data
dls = to_nn.dataloaders(1024)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment