Principle:Scikit learn Scikit learn Ensemble Training
Overview
A training process that fits multiple base estimators, either in parallel (bagging) or sequentially (boosting), to create a combined model.
Description
Ensemble training is the process by which an ensemble of base estimators is fitted to training data. The two fundamental paradigms are:
- Parallel fitting (bagging): All base estimators are trained independently, each on a different bootstrap sample of the original training data. Because the individual fits do not depend on one another, they can be executed concurrently across multiple CPU cores. Forest-based ensembles such as random forests and extra-trees use this paradigm. The scikit-learn implementation leverages joblib to distribute tree construction across workers, preferring a threading backend because the underlying Cython code releases the Python GIL.
- Sequential fitting (boosting): Each base estimator is trained in sequence, with each new estimator correcting the errors of the combined previous estimators. Gradient boosting and AdaBoost follow this paradigm. Sequential fitting cannot be parallelized across estimators, though individual tree construction may still benefit from parallelism.
Bootstrap sampling is central to parallel ensemble training. For each tree, a sample of size Template:Code (or Template:Code if specified) is drawn with replacement from the training set. The samples not drawn -- the out-of-bag (OOB) samples -- can be used to estimate generalization error without a separate validation set.
Warm starting allows incremental training: when Template:Code, the ensemble retains previously fitted estimators and adds new ones. This is useful for incrementally increasing the ensemble size or for monitoring performance as more estimators are added.
Usage
Ensemble training is the core fitting step for all ensemble methods. It is invoked when:
- You call Template:Code on any forest-based ensemble (RandomForest, ExtraTrees).
- You need to control parallelism via Template:Code for faster training.
- You want to use OOB scoring as a built-in validation mechanism.
- You want to incrementally grow the ensemble via Template:Code.
Theoretical Basis
The theoretical foundations of ensemble training include:
- Bootstrap Aggregating: Drawing samples with replacement creates diverse training sets, leading to diverse models whose averaged predictions have lower variance than any single model.
- Parallel Independence: In bagging, each base learner is independent of all others, enabling embarrassingly parallel computation. The quality of the ensemble depends on the diversity introduced by bootstrap sampling and random feature selection, not on any ordering of the estimators.
- Out-of-Bag Estimation: Each bootstrap sample excludes approximately 37% of the original data. These excluded samples provide a free validation set, enabling unbiased error estimation without cross-validation.
- Warm-Start Incrementalism: By retaining existing estimators and adding new ones, warm starting avoids redundant computation. The random state is advanced to maintain reproducibility regardless of whether warm starting is used.