Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Fastai Fastbook RandomForestRegressor

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Ensemble Methods, Tabular Data
Last Updated 2026-02-09 17:00 GMT

Overview

Concrete tool for training random forest regression models provided by scikit-learn. Used in the fastbook Tabular Modeling chapter as the primary tree-based model for the Bulldozers competition.

Description

RandomForestRegressor from scikit-learn implements the random forest algorithm for regression tasks. In the fastbook chapter, it is wrapped in a convenience function rf() that sets recommended hyperparameters: 40 estimators, 200,000 max samples per tree, 0.5 max features fraction, 5 minimum samples per leaf, and OOB scoring enabled. The chapter also uses DecisionTreeRegressor as an introductory building block to explain how individual trees work before composing them into a forest.

The key outputs are the fitted model (for predictions), the oob_prediction_ array (OOB predictions on training data), the oob_score_ R-squared metric, and the feature_importances_ array.

Usage

Use RandomForestRegressor after preprocessing data with TabularPandas. It accepts the .train.xs features and .train.y target from the TabularPandas object. The fitted model can predict on both training and validation features to assess overfitting via RMSE comparison.

Code Reference

Source Location

  • Repository: fastbook
  • File: translations/cn/09_tabular.md (Lines 449-683)
  • Note: RandomForestRegressor and DecisionTreeRegressor are external scikit-learn classes. The fastbook chapter demonstrates their usage and wraps them in helper functions.

Signature

# Individual decision tree (used for visualization and understanding)
DecisionTreeRegressor(max_leaf_nodes=4, min_samples_leaf=25)

# Random forest ensemble
RandomForestRegressor(
    n_jobs=-1,           # Use all CPU cores
    n_estimators=40,     # Number of trees in the forest
    max_samples=200_000, # Max rows per tree (subsample size)
    max_features=0.5,    # Fraction of columns considered per split
    min_samples_leaf=5,  # Minimum samples required in each leaf
    oob_score=True       # Enable out-of-bag scoring
)

# Fastbook convenience wrapper
def rf(xs, y, n_estimators=40, max_samples=200_000,
       max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(
        n_jobs=-1, n_estimators=n_estimators,
        max_samples=max_samples, max_features=max_features,
        min_samples_leaf=min_samples_leaf, oob_score=True
    ).fit(xs, y)

Import

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor

I/O Contract

Inputs

Name Type Required Description
xs pandas.DataFrame or numpy.ndarray Yes Feature matrix from to.train.xs. All values must be numeric (post-Categorify and FillMissing).
y pandas.Series or numpy.ndarray Yes Target vector from to.train.y. For the Bulldozers dataset, this is the log of sale price.
n_estimators int No Number of trees in the forest. Default 40 in the fastbook wrapper.
max_samples int No Maximum number of training samples per tree. Default 200,000.
max_features float No Fraction of features to consider at each split. Default 0.5.
min_samples_leaf int No Minimum number of samples in each leaf node. Default 5.
oob_score bool No Whether to compute out-of-bag score. Default True in the fastbook wrapper.
n_jobs int No Number of parallel jobs. -1 uses all available CPUs.

Outputs

Name Type Description
Fitted model RandomForestRegressor The trained model object. Call m.predict(xs) to generate predictions.
m.oob_prediction_ numpy.ndarray OOB predictions for each training row. Only available when oob_score=True.
m.oob_score_ float R-squared score computed from OOB predictions. 1.0 = perfect, 0.0 = random.
m.feature_importances_ numpy.ndarray Normalized importance scores for each feature, summing to 1.0.
m.estimators_ list of DecisionTreeRegressor The individual trees in the forest, accessible for per-tree prediction analysis.

Usage Examples

Basic Usage

from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
import math

# Assume 'to' is a TabularPandas object
xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

# Step 1: Understand with a simple decision tree
m_tree = DecisionTreeRegressor(max_leaf_nodes=4)
m_tree.fit(xs, y)

# Step 2: Train a full random forest
def rf(xs, y, n_estimators=40, max_samples=200_000,
       max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(
        n_jobs=-1, n_estimators=n_estimators,
        max_samples=max_samples, max_features=max_features,
        min_samples_leaf=min_samples_leaf, oob_score=True
    ).fit(xs, y)

m = rf(xs, y)

# Evaluation helper functions
def r_mse(pred, y): return round(math.sqrt(((pred-y)**2).mean()), 6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)

# Check training vs validation RMSE
print(f"Train RMSE: {m_rmse(m, xs, y)}")        # ~0.171
print(f"Valid RMSE: {m_rmse(m, valid_xs, valid_y)}")  # ~0.234

# Check OOB error (intermediate between train and valid)
print(f"OOB RMSE:   {r_mse(m.oob_prediction_, y)}")  # ~0.211

Analyzing Prediction Convergence

import numpy as np

# Get per-tree predictions on validation set
preds = np.stack([t.predict(valid_xs) for t in m.estimators_])

# Plot RMSE as function of number of trees
import matplotlib.pyplot as plt
plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(40)])
plt.xlabel('Number of Trees')
plt.ylabel('Validation RMSE')
plt.title('RMSE convergence with ensemble size')

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment