Implementation:Fastai Fastbook RandomForestRegressor
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Ensemble Methods, Tabular Data |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Concrete tool for training random forest regression models provided by scikit-learn. Used in the fastbook Tabular Modeling chapter as the primary tree-based model for the Bulldozers competition.
Description
RandomForestRegressor from scikit-learn implements the random forest algorithm for regression tasks. In the fastbook chapter, it is wrapped in a convenience function rf() that sets recommended hyperparameters: 40 estimators, 200,000 max samples per tree, 0.5 max features fraction, 5 minimum samples per leaf, and OOB scoring enabled. The chapter also uses DecisionTreeRegressor as an introductory building block to explain how individual trees work before composing them into a forest.
The key outputs are the fitted model (for predictions), the oob_prediction_ array (OOB predictions on training data), the oob_score_ R-squared metric, and the feature_importances_ array.
Usage
Use RandomForestRegressor after preprocessing data with TabularPandas. It accepts the .train.xs features and .train.y target from the TabularPandas object. The fitted model can predict on both training and validation features to assess overfitting via RMSE comparison.
Code Reference
Source Location
- Repository: fastbook
- File: translations/cn/09_tabular.md (Lines 449-683)
- Note:
RandomForestRegressorandDecisionTreeRegressorare external scikit-learn classes. The fastbook chapter demonstrates their usage and wraps them in helper functions.
Signature
# Individual decision tree (used for visualization and understanding)
DecisionTreeRegressor(max_leaf_nodes=4, min_samples_leaf=25)
# Random forest ensemble
RandomForestRegressor(
n_jobs=-1, # Use all CPU cores
n_estimators=40, # Number of trees in the forest
max_samples=200_000, # Max rows per tree (subsample size)
max_features=0.5, # Fraction of columns considered per split
min_samples_leaf=5, # Minimum samples required in each leaf
oob_score=True # Enable out-of-bag scoring
)
# Fastbook convenience wrapper
def rf(xs, y, n_estimators=40, max_samples=200_000,
max_features=0.5, min_samples_leaf=5, **kwargs):
return RandomForestRegressor(
n_jobs=-1, n_estimators=n_estimators,
max_samples=max_samples, max_features=max_features,
min_samples_leaf=min_samples_leaf, oob_score=True
).fit(xs, y)
Import
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| xs | pandas.DataFrame or numpy.ndarray | Yes | Feature matrix from to.train.xs. All values must be numeric (post-Categorify and FillMissing).
|
| y | pandas.Series or numpy.ndarray | Yes | Target vector from to.train.y. For the Bulldozers dataset, this is the log of sale price.
|
| n_estimators | int | No | Number of trees in the forest. Default 40 in the fastbook wrapper. |
| max_samples | int | No | Maximum number of training samples per tree. Default 200,000. |
| max_features | float | No | Fraction of features to consider at each split. Default 0.5. |
| min_samples_leaf | int | No | Minimum number of samples in each leaf node. Default 5. |
| oob_score | bool | No | Whether to compute out-of-bag score. Default True in the fastbook wrapper. |
| n_jobs | int | No | Number of parallel jobs. -1 uses all available CPUs. |
Outputs
| Name | Type | Description |
|---|---|---|
| Fitted model | RandomForestRegressor | The trained model object. Call m.predict(xs) to generate predictions.
|
| m.oob_prediction_ | numpy.ndarray | OOB predictions for each training row. Only available when oob_score=True.
|
| m.oob_score_ | float | R-squared score computed from OOB predictions. 1.0 = perfect, 0.0 = random. |
| m.feature_importances_ | numpy.ndarray | Normalized importance scores for each feature, summing to 1.0. |
| m.estimators_ | list of DecisionTreeRegressor | The individual trees in the forest, accessible for per-tree prediction analysis. |
Usage Examples
Basic Usage
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
import math
# Assume 'to' is a TabularPandas object
xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y
# Step 1: Understand with a simple decision tree
m_tree = DecisionTreeRegressor(max_leaf_nodes=4)
m_tree.fit(xs, y)
# Step 2: Train a full random forest
def rf(xs, y, n_estimators=40, max_samples=200_000,
max_features=0.5, min_samples_leaf=5, **kwargs):
return RandomForestRegressor(
n_jobs=-1, n_estimators=n_estimators,
max_samples=max_samples, max_features=max_features,
min_samples_leaf=min_samples_leaf, oob_score=True
).fit(xs, y)
m = rf(xs, y)
# Evaluation helper functions
def r_mse(pred, y): return round(math.sqrt(((pred-y)**2).mean()), 6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)
# Check training vs validation RMSE
print(f"Train RMSE: {m_rmse(m, xs, y)}") # ~0.171
print(f"Valid RMSE: {m_rmse(m, valid_xs, valid_y)}") # ~0.234
# Check OOB error (intermediate between train and valid)
print(f"OOB RMSE: {r_mse(m.oob_prediction_, y)}") # ~0.211
Analyzing Prediction Convergence
import numpy as np
# Get per-tree predictions on validation set
preds = np.stack([t.predict(valid_xs) for t in m.estimators_])
# Plot RMSE as function of number of trees
import matplotlib.pyplot as plt
plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(40)])
plt.xlabel('Number of Trees')
plt.ylabel('Validation RMSE')
plt.title('RMSE convergence with ensemble size')