Principle:Fastai Fastbook Model Ensembling
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Ensemble Methods, Model Interpretation |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Model ensembling is the technique of combining predictions from multiple independently trained models to produce a final prediction that is more accurate and robust than any individual model, based on the principle that uncorrelated errors cancel out when averaged.
Description
The fundamental insight behind ensembling is that different models make different kinds of errors. If those errors are uncorrelated, averaging the predictions reduces the overall error. This principle operates at two levels in the fastbook Tabular Modeling chapter:
Level 1 -- Within-algorithm ensembling: A random forest is itself an ensemble of decision trees. Each tree is trained on a different bootstrap sample with different random feature subsets, so their errors are partially decorrelated. Averaging 40 trees reduces variance compared to any single tree.
Level 2 -- Cross-algorithm ensembling: A random forest and a neural network use fundamentally different learning algorithms (recursive partitioning vs. gradient-based optimization with embeddings). They represent the data differently, handle extrapolation differently, and fail in different ways. Averaging their predictions produces results better than either model alone. The fastbook chapter demonstrates this by simple arithmetic averaging of the random forest and neural network predictions, yielding the best RMSE of the entire chapter.
Additionally, waterfall charts (powered by the treeinterpreter library) provide a visual decomposition of individual predictions into per-feature contributions, complementing ensemble predictions with interpretability. While waterfall charts are a visualization technique rather than an ensembling method, they are presented together in the chapter as part of the overall model interpretation and production deployment toolkit.
Usage
Apply model ensembling when:
- You have trained multiple models using different algorithms (e.g., random forest + neural network) and want to improve overall accuracy.
- The individual models have comparable performance levels -- ensembling a strong model with a very weak model provides little benefit.
- You need a simple, reliable way to boost accuracy without extensive hyperparameter tuning.
- You are in a competition setting (Kaggle) where small accuracy improvements matter.
Use waterfall charts when:
- You need to explain individual predictions to stakeholders.
- You want to verify that the model is making predictions for sensible reasons.
- You are building a data product where end users need to understand prediction rationale.
Theoretical Basis
Averaging Reduces Variance
Consider M models with predictions p_1, p_2, ..., p_M for a given input. Assume each model has the same expected error and that their errors are uncorrelated. The variance of the average prediction is:
Var(average) = Var(p_i) / M
As M increases, the variance of the ensemble decreases linearly. In practice, errors are never perfectly uncorrelated, so the improvement is less than 1/M, but it is still significant when models use genuinely different algorithms.
Cross-Algorithm Ensembling
A random forest and a neural network have complementary strengths:
| Property | Random Forest | Neural Network |
|---|---|---|
| Error type | Cannot extrapolate; predictions plateau at training range boundaries | Can extrapolate but may overfit to noise in small datasets |
| Feature interaction | Discovers interactions via sequential splits | Learns interactions via matrix multiplication in hidden layers |
| Categorical handling | Binary splits on integer codes | Dense embedding vectors capturing semantic similarity |
| Training variance | Low (many trees average out) | Higher (sensitive to initialization and learning rate) |
Because these models fail in different ways, their errors are substantially uncorrelated, making their average more accurate than either alone.
Simple Average Ensemble
The simplest ensemble method is an unweighted arithmetic average:
ensemble_prediction = (rf_prediction + nn_prediction) / 2
This requires no additional training or validation. More sophisticated methods (weighted averaging, stacking) can provide further improvement but add complexity. The fastbook chapter uses the simple average and reports that it outperforms both individual models.
One implementation detail: scikit-learn returns predictions as a rank-1 NumPy array (vector), while PyTorch returns a rank-2 tensor (column matrix). Before averaging, the PyTorch predictions must be squeezed and converted to NumPy:
ens_preds = (to_np(preds.squeeze()) + rf_preds) / 2
Waterfall Chart Decomposition
A waterfall chart visualizes the additive decomposition from treeinterpreter:
prediction = bias + contribution_1 + contribution_2 + ... + contribution_n
The chart starts at the bias (global mean), then shows each feature's positive or negative contribution as a bar segment, ending at the final prediction. Small contributions below a threshold (e.g., 0.08) are grouped into an "Others" category for readability.
This decomposition is specific to tree-based models. For neural network predictions, analogous techniques (SHAP, LIME) can provide similar per-feature explanations but are not covered in this chapter.