Principle:Fastai Fastbook Feature Importance

Knowledge Sources	Random Forests (Breiman, 2001) Greedy Function Approximation: A Gradient Boosting Machine (Friedman, 2001) Deep Learning for Coders with fastai and PyTorch
Domains	Model Interpretation, Machine Learning, Feature Selection
Last Updated	2026-02-09 17:00 GMT

Overview

Feature importance analysis is a family of techniques for quantifying how much each input variable contributes to a model's predictions, enabling practitioners to understand model behavior, simplify models by removing irrelevant features, and detect data quality issues such as leakage.

Description

After training a model, knowing which features drive its predictions is often as important as the predictions themselves. Feature importance analysis answers several critical questions identified in the fastbook chapter:

Which columns are the strongest predictors? Global feature importance ranks all features by their contribution to model accuracy across the entire dataset.
Which columns are redundant? Highly correlated features can be identified and removed to simplify the model without sacrificing accuracy.
What is the relationship between a feature and the target? Partial dependence plots isolate the marginal effect of a single feature on the prediction, holding all other features constant.
What drives an individual prediction? Tree interpretation decomposes a single prediction into per-feature contributions, explaining why the model predicted a specific value for a specific row.
Is there data leakage? Unexpectedly high importance for an identifier or date-derived column may indicate that information from the future or from the target is leaking into the features.

These techniques form a comprehensive interpretability toolkit that applies not just to random forests but, in various forms, to most machine learning models.

Usage

Apply feature importance analysis after training a model and before finalizing the feature set or deploying to production. Specific use cases include:

Feature selection: Remove columns with importance below a threshold (e.g., 0.005) to create a simpler, more maintainable model.
Redundancy removal: Use rank correlation clustering to identify pairs of nearly interchangeable features, then drop one from each pair while monitoring OOB score.
Data leakage detection: Investigate features that have unexpectedly high importance, especially identifiers or metadata columns.
Stakeholder communication: Use partial dependence plots and waterfall charts to explain model behavior to non-technical audiences.

Theoretical Basis

Mean Decrease in Impurity (MDI)

The default feature importance method in scikit-learn random forests is mean decrease in impurity. For each tree and each internal node:

Identify the feature j used for the split.
Compute the weighted impurity decrease: delta = N_node * I_node - N_left * I_left - N_right * I_right, where I is the impurity measure (MSE for regression, Gini for classification) and N is the sample count.
Add delta to the cumulative importance of feature j.

After processing all trees, the importance scores are normalized so they sum to 1.0. Features used for many high-improvement splits across many trees receive the highest scores.

Removing Low-Importance Features

Given the importance vector, a threshold tau is chosen (e.g., tau = 0.005). Features with importance below tau are removed. The model is retrained on the reduced feature set, and RMSE is compared to the original. In the fastbook chapter, this reduced the feature count from 78 to 21 columns with no loss in accuracy.

Removing Redundant Features

Features may be redundant if they encode similar information. Rank correlation measures this: all values in each column are replaced by their rank (1st, 2nd, ...), and the Spearman correlation is computed between column pairs. Highly correlated pairs (visualized via cluster_columns) can be tested for removal by comparing OOB scores:

Compute baseline OOB R-squared with all features.
For each candidate redundant column, remove it and recompute OOB R-squared.
If the score does not decrease meaningfully, the column can be safely removed.

Partial Dependence

Introduced by Friedman (2001), partial dependence isolates the marginal effect of feature j on the prediction:

Choose a grid of values v_1, v_2, ..., v_G for feature j.
For each grid value v_g, set column j to v_g for every row in the dataset.
Compute the model prediction for each modified row.
Average the predictions across all rows to obtain PD(v_g).

The partial dependence function PD(v) shows how the average prediction changes as feature j varies, with all other features held at their observed values. This differs from simply plotting the target against the feature, which conflates the effect of j with the effects of correlated features.

Tree Interpretation (Per-Row Decomposition)

For a single prediction, the tree interpreter traces the path through each tree:

Start at the root node, where the prediction equals the global mean (bias).
At each split node, compute the change in prediction: delta_node = mean(child) - mean(parent).
Attribute delta_node to the feature used at that split.
Sum contributions across all trees and all nodes on the path.

The result is a decomposition: prediction = bias + sum(contributions_j for all features j). This provides an additive explanation of why a specific row received its predicted value.

Related Pages

Implemented By

Implementation:Fastai_Fastbook_Feature_Importances

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment