Principle:Fastai Fastbook Random Forest
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Ensemble Methods, Tabular Data |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
A random forest is an ensemble of decision trees, each trained on a random subset of rows and considering a random subset of features at each split, whose predictions are averaged to produce a final estimate that is more accurate and less prone to overfitting than any individual tree.
Description
The random forest algorithm, introduced by Leo Breiman in 2001, builds upon two foundational ideas:
Decision Trees: A decision tree recursively partitions the input space by selecting, at each node, the feature and threshold that best separate the target variable. The tree grows until a stopping criterion is met (e.g., minimum samples per leaf, maximum depth, or maximum number of leaf nodes). While a single deep tree can achieve zero training error, it typically overfits severely -- the fastbook chapter demonstrates a tree with 340,909 leaf nodes for 404,710 training samples, yielding perfect training RMSE of 0.0 but a validation RMSE of 0.338.
Bagging (Bootstrap Aggregating): Breiman's 1994 insight was that training multiple models on random subsets of the data and averaging their predictions reduces variance without increasing bias. The errors of individual models, being uncorrelated, tend to cancel out when averaged.
A random forest combines these ideas and adds column subsampling: at each split point within each tree, only a random fraction of features are considered as candidates. This decorrelates the trees further, since without column subsampling all trees would tend to split on the same dominant features.
The key properties of random forests are:
- Robustness to hyperparameters: Performance is relatively stable across a wide range of parameter choices.
- Built-in validation via OOB error: Each tree is trained on a subset of rows, so the rows not used (out-of-bag rows) provide a free validation estimate.
- No extrapolation: A random forest cannot predict values outside the range of its training data, since predictions are averages of leaf values. This is a critical limitation for time-series data with trends.
- Parallelizable: Trees are independent and can be built on separate CPU cores.
Usage
Use random forests when:
- You need a strong baseline model for tabular data with minimal hyperparameter tuning.
- You want built-in feature importance and OOB error estimates.
- The dataset is large enough to benefit from bagging (thousands of rows or more).
- You need interpretable model diagnostics (feature importance, partial dependence).
Avoid random forests when:
- The prediction task requires extrapolation beyond the training data range.
- The dataset has very high-cardinality categorical features that benefit from learned embeddings.
- The target relationship requires capturing complex interactions that deep learning handles better.
Theoretical Basis
Decision Tree Splitting
At each node, the algorithm evaluates all candidate features and all possible split thresholds. For regression with mean squared error (MSE), the best split minimizes:
MSE_left * N_left + MSE_right * N_right
where MSE_left and MSE_right are the mean squared errors of the left and right child groups, and N_left and N_right are their respective sample counts. The prediction at each leaf is the mean of the target values in that leaf.
Bagging Procedure
Given a training set of N samples:
- For each tree t = 1, ..., T:
- Draw a bootstrap sample of size max_samples from the N training rows (sampling with replacement by default, or without replacement if max_samples < N).
- Grow a decision tree on this sample. At each split, randomly select max_features fraction of the columns as candidates.
- Apply the stopping criterion (e.g., min_samples_leaf).
- For prediction, average the outputs of all T trees.
Out-of-Bag (OOB) Error
For each training row i, define the set of trees that did not include row i in their bootstrap sample. The OOB prediction for row i is the average of predictions from only those trees. The OOB score (R-squared) provides an estimate of generalization performance without needing a separate validation set:
OOB_R2 = 1 - (sum of squared OOB residuals) / (sum of squared deviations from mean)
In the fastbook chapter, the OOB RMSE of 0.211 falls between the training RMSE (0.171) and validation RMSE (0.234), confirming its utility as an intermediate diagnostic.
Convergence with Number of Trees
As T increases, the variance of the ensemble prediction decreases. The fastbook chapter demonstrates that RMSE improvements flatten after approximately 30 trees, showing diminishing returns from additional estimators.