Implementation:Fastai Fastbook RandomForestRegressor

Knowledge Sources	fastbook sklearn RandomForestRegressor
Domains	Machine Learning, Ensemble Methods, Tabular Data
Last Updated	2026-02-09 17:00 GMT

Overview

Concrete tool for training random forest regression models provided by scikit-learn. Used in the fastbook Tabular Modeling chapter as the primary tree-based model for the Bulldozers competition.

Description

RandomForestRegressor from scikit-learn implements the random forest algorithm for regression tasks. In the fastbook chapter, it is wrapped in a convenience function rf() that sets recommended hyperparameters: 40 estimators, 200,000 max samples per tree, 0.5 max features fraction, 5 minimum samples per leaf, and OOB scoring enabled. The chapter also uses DecisionTreeRegressor as an introductory building block to explain how individual trees work before composing them into a forest.

The key outputs are the fitted model (for predictions), the oob_prediction_ array (OOB predictions on training data), the oob_score_ R-squared metric, and the feature_importances_ array.

Usage

Use RandomForestRegressor after preprocessing data with TabularPandas. It accepts the .train.xs features and .train.y target from the TabularPandas object. The fitted model can predict on both training and validation features to assess overfitting via RMSE comparison.

Code Reference

Source Location

Repository: fastbook
File: translations/cn/09_tabular.md (Lines 449-683)
Note: RandomForestRegressor and DecisionTreeRegressor are external scikit-learn classes. The fastbook chapter demonstrates their usage and wraps them in helper functions.

Signature

# Individual decision tree (used for visualization and understanding)
DecisionTreeRegressor(max_leaf_nodes=4, min_samples_leaf=25)

# Random forest ensemble
RandomForestRegressor(
    n_jobs=-1,           # Use all CPU cores
    n_estimators=40,     # Number of trees in the forest
    max_samples=200_000, # Max rows per tree (subsample size)
    max_features=0.5,    # Fraction of columns considered per split
    min_samples_leaf=5,  # Minimum samples required in each leaf
    oob_score=True       # Enable out-of-bag scoring
)

# Fastbook convenience wrapper
def rf(xs, y, n_estimators=40, max_samples=200_000,
       max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(
        n_jobs=-1, n_estimators=n_estimators,
        max_samples=max_samples, max_features=max_features,
        min_samples_leaf=min_samples_leaf, oob_score=True
    ).fit(xs, y)

Import

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

I/O Contract

Inputs

Name	Type	Required	Description
xs	pandas.DataFrame or numpy.ndarray	Yes	Feature matrix from `to.train.xs`. All values must be numeric (post-Categorify and FillMissing).
y	pandas.Series or numpy.ndarray	Yes	Target vector from `to.train.y`. For the Bulldozers dataset, this is the log of sale price.
n_estimators	int	No	Number of trees in the forest. Default 40 in the fastbook wrapper.
max_samples	int	No	Maximum number of training samples per tree. Default 200,000.
max_features	float	No	Fraction of features to consider at each split. Default 0.5.
min_samples_leaf	int	No	Minimum number of samples in each leaf node. Default 5.
oob_score	bool	No	Whether to compute out-of-bag score. Default True in the fastbook wrapper.
n_jobs	int	No	Number of parallel jobs. -1 uses all available CPUs.

Outputs

Name	Type	Description
Fitted model	RandomForestRegressor	The trained model object. Call `m.predict(xs)` to generate predictions.
m.oob_prediction_	numpy.ndarray	OOB predictions for each training row. Only available when `oob_score=True`.
m.oob_score_	float	R-squared score computed from OOB predictions. 1.0 = perfect, 0.0 = random.
m.feature_importances_	numpy.ndarray	Normalized importance scores for each feature, summing to 1.0.
m.estimators_	list of DecisionTreeRegressor	The individual trees in the forest, accessible for per-tree prediction analysis.

Usage Examples

Basic Usage

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
import math

# Assume 'to' is a TabularPandas object
xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

# Step 1: Understand with a simple decision tree
m_tree = DecisionTreeRegressor(max_leaf_nodes=4)
m_tree.fit(xs, y)

# Step 2: Train a full random forest
def rf(xs, y, n_estimators=40, max_samples=200_000,
       max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(
        n_jobs=-1, n_estimators=n_estimators,
        max_samples=max_samples, max_features=max_features,
        min_samples_leaf=min_samples_leaf, oob_score=True
    ).fit(xs, y)

m = rf(xs, y)

# Evaluation helper functions
def r_mse(pred, y): return round(math.sqrt(((pred-y)**2).mean()), 6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)

# Check training vs validation RMSE
print(f"Train RMSE: {m_rmse(m, xs, y)}")        # ~0.171
print(f"Valid RMSE: {m_rmse(m, valid_xs, valid_y)}")  # ~0.234

# Check OOB error (intermediate between train and valid)
print(f"OOB RMSE:   {r_mse(m.oob_prediction_, y)}")  # ~0.211

Analyzing Prediction Convergence

import numpy as np

# Get per-tree predictions on validation set
preds = np.stack([t.predict(valid_xs) for t in m.estimators_])

# Plot RMSE as function of number of trees
import matplotlib.pyplot as plt
plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(40)])
plt.xlabel('Number of Trees')
plt.ylabel('Validation RMSE')
plt.title('RMSE convergence with ensemble size')

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment