Principle:Fastai Fastbook Feature Engineering
| Knowledge Sources | |
|---|---|
| Domains | Feature Engineering, Tabular Data, Time Series |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Feature engineering is the process of transforming raw data columns into new, more informative representations that enable machine learning models to learn patterns more effectively.
Description
Raw datasets often contain columns whose values are not directly amenable to the splitting or gradient-based operations used by machine learning algorithms. Feature engineering bridges this gap by creating derived columns that expose latent structure. A particularly important case is date feature extraction: a single date column (e.g., "2011-03-15") encodes many distinct pieces of information -- year, month, day of week, whether it falls on a holiday, whether it is a month-end, and so on. A decision tree cannot efficiently discover these patterns from a raw date because it can only perform binary splits on a single ordinal value. By decomposing the date into its constituent temporal features, we give the model direct access to each dimension of temporal variation.
The general principle extends beyond dates to any column where domain knowledge suggests that derived features would be more informative than the raw value. Examples include:
- Polynomial features: Creating interaction terms or powers of numeric columns.
- Binning: Converting continuous values into categorical bins.
- Text extraction: Pulling structured fields (e.g., domain name from a URL).
- Temporal decomposition: Splitting a date into year, month, week, day, day-of-week, day-of-year, and boolean flags for month-start, month-end, quarter-start, quarter-end, year-start, year-end, plus an elapsed-time numeric value.
Usage
Apply feature engineering whenever:
- The dataset contains date or timestamp columns and the model cannot natively reason about temporal patterns (decision trees, random forests, most neural networks).
- Domain knowledge suggests that a raw column encodes multiple independent signals.
- Initial modeling reveals poor performance that could be addressed by providing the model with richer input features.
- You want to enable a tree-based model to capture cyclical or calendar effects.
Theoretical Basis
Date decomposition is motivated by the structure of decision trees. A decision tree partitions data by choosing a feature and a threshold that best separates the target variable. Given a raw date represented as an integer (e.g., days since epoch), the tree can only split on "before vs. after" a single date. This is insufficient to capture:
- Cyclical patterns: Sales may be higher on weekends regardless of the year. A "day of week" feature directly encodes this.
- Seasonal patterns: Demand may peak in certain months. A "month" feature captures this.
- Trend: Prices may increase year over year. A "year" feature isolates this.
By decomposing a single date column into N temporal features, we transform a one-dimensional input into an N-dimensional input, allowing the tree to split on each dimension independently. The standard decomposition produces 13 features:
| Feature | Type | Description |
|---|---|---|
| Year | int | Calendar year (e.g., 2011) |
| Month | int | Month of the year (1-12) |
| Week | int | ISO week number (1-53) |
| Day | int | Day of the month (1-31) |
| Dayofweek | int | Day of the week (0=Monday, 6=Sunday) |
| Dayofyear | int | Day of the year (1-366) |
| Is_month_end | bool | Whether the date is the last day of the month |
| Is_month_start | bool | Whether the date is the first day of the month |
| Is_quarter_end | bool | Whether the date is the last day of a quarter |
| Is_quarter_start | bool | Whether the date is the first day of a quarter |
| Is_year_end | bool | Whether the date is the last day of the year |
| Is_year_start | bool | Whether the date is the first day of the year |
| Elapsed | float | Seconds since Unix epoch (continuous numeric) |
The Elapsed feature is particularly important because it preserves the monotonic ordering of time, enabling the model to capture long-term trends. The boolean flags and cyclical integer features enable the model to capture calendar-based periodic effects.