Workflow:Fastai Fastbook Tabular Modeling

Knowledge Sources	fastai/fastbook fastai Documentation scikit-learn Documentation Entity Embeddings of Categorical Variables
Domains	Tabular_Data, Machine_Learning, Deep_Learning
Last Updated	2026-02-09 17:00 GMT

Overview

End-to-end process for modeling structured tabular data using both tree-based ensemble methods (random forests, gradient boosting) and deep learning with entity embeddings.

Description

This workflow addresses prediction tasks on structured, table-format data (rows and columns). It combines two complementary approaches: decision tree ensembles (random forests and gradient boosting machines) as the primary baseline due to their speed, interpretability, and robustness, and deep learning with categorical entity embeddings for cases with high-cardinality categorical variables or mixed data types. The workflow covers data preprocessing (handling dates, missing values, categorical encoding), model training, feature importance analysis, and model interpretation. It uses scikit-learn for tree-based models, Pandas for data manipulation, and fastai for deep learning tabular models.

Usage

Execute this workflow when working with structured data in table format (CSV, database tables, spreadsheets) where the goal is to predict a target column from other columns. Start with random forests as a fast, robust baseline. Add deep learning if the dataset contains high-cardinality categorical variables (like zip codes or product IDs) or columns with natural language text that benefit from embeddings.

Execution Steps

Step 1: Data Loading and Exploration

Load the tabular dataset using Pandas and perform initial exploration. Examine column types (continuous vs. categorical), check for missing values, understand the target variable distribution, and identify date columns that need feature extraction. Use descriptive statistics and visualizations to understand the data.

Key considerations:

Identify which columns are continuous (numeric) and which are categorical (discrete levels)
Check for and understand patterns of missing data
Examine the target variable for class imbalance or skewed distributions

Step 2: Feature Engineering

Transform raw columns into features suitable for modeling. Extract components from date columns (year, month, day, day of week, is_holiday, etc.) using fastai's add_datepart utility. Convert string columns to categorical type. Create any domain-specific derived features.

Key considerations:

Date decomposition is critical: temporal patterns (weekday vs. weekend, seasonal trends) are often highly predictive
Convert ordered categories to their natural numeric ordering when applicable
Keep feature engineering minimal initially; let the model learn relationships

Step 3: Data Preprocessing

Handle missing values, normalize continuous variables, and encode categorical variables. For tree-based models, replace missing continuous values with the median and add a binary indicator column. For deep learning, apply fastai's TabularProc processors: Categorify (converts categories to integer codes) and FillMissing (handles nulls), and Normalize (standardizes continuous columns).

Key considerations:

Trees handle missing values and unnormalized data well; deep learning needs explicit preprocessing
Use the same preprocessing fitted on training data when applied to validation/test data
Set aside a proper validation set (time-based split for temporal data, random otherwise)

Step 4: Random Forest Training

Train a random forest model as the initial baseline. A random forest is an ensemble of decision trees, each trained on a random subset of rows (bagging) and considering a random subset of features at each split. The ensemble's prediction is the average of all individual tree predictions. Start with a large number of trees and default hyperparameters.

Key considerations:

Random forests are resistant to overfitting as more trees are added
Use out-of-bag (OOB) error as a quick validation metric without needing a separate validation set
Start with n_estimators=100 or more; increasing trees improves accuracy up to a plateau

Step 5: Feature Importance Analysis

Compute and analyze feature importance to understand which columns drive predictions. Use the random forest's built-in feature importance (mean decrease in impurity) or permutation importance (shuffle one column and measure accuracy drop). Remove low-importance features to simplify the model and potentially improve generalization.

Key considerations:

Permutation importance is more reliable than impurity-based importance for correlated features
Removing redundant features often maintains or improves accuracy
Use partial dependence plots to visualize the relationship between important features and predictions

Step 6: Deep Learning Tabular Model

Train a fastai tabular deep learning model using entity embeddings for categorical variables. The model creates learned embedding vectors for each level of each categorical variable, concatenates them with normalized continuous variables, and passes the result through fully connected layers. This approach excels with high-cardinality categoricals.

Key considerations:

Specify categorical and continuous column lists explicitly
Entity embeddings learn meaningful representations of categories (e.g., geographic proximity from store IDs)
Use dropout and weight decay for regularization
Compare deep learning results against the random forest baseline

Step 7: Model Interpretation and Ensembling

Interpret the final models using tools like partial dependence plots, tree interpreters, and embedding analysis. Optionally ensemble the random forest and deep learning predictions for best results. Analyze residuals to identify systematic prediction errors and guide further feature engineering.

Key considerations:

Embedding weights can be extracted and used for data visualization or clustering
Ensembling tree-based and neural network models often outperforms either alone
Waterfall plots show how individual features contribute to specific predictions

Execution Diagram

GitHub URL

Workflow Repository