Principle:Scikit learn Scikit learn Stacking Ensemble
Overview
An ensemble strategy that trains a meta-learner on the cross-validated predictions of multiple base estimators.
Description
Stacking (stacked generalization) is a two-level ensemble architecture. At the first level, a set of diverse base estimators are each trained on the full training data. Their predictions -- generated via cross-validation on the training set -- are collected to form a new set of meta-features. At the second level, a final estimator (the meta-learner) is trained on these meta-features to learn the optimal way to combine the base learners' outputs.
The critical design choice in stacking is the use of cross-validated predictions for constructing the meta-features. If base learners were simply asked to predict on their own training data, the meta-learner would be trained on overly optimistic predictions, leading to severe overfitting. By using k-fold cross-validation, each training sample's meta-feature is generated by a model that did not see that sample during training, producing honest out-of-fold predictions.
The meta-features passed to the final estimator can be class probabilities (via Template:Code), decision function scores (via Template:Code), or raw class predictions (via Template:Code). Optionally, the original input features can be passed through alongside the meta-features, giving the final estimator access to both raw inputs and base learner outputs.
Usage
Stacking ensembles are appropriate when:
- You have multiple strong but diverse base learners and want to optimally learn how to weight and combine them.
- Simple averaging or majority voting does not capture the complementary patterns in base learner predictions.
- You are willing to accept the additional computational cost of cross-validated meta-feature generation.
- You want a principled method that adapts the combination strategy to the data rather than using fixed weights.
Theoretical Basis
The theoretical foundations of stacking include:
- Stacked Generalization (Wolpert, 1992): The original framework proposes using "level-0" generalizers (base learners) whose outputs feed into a "level-1" generalizer (meta-learner). The key insight is that the level-1 generalizer can learn to correct the biases and exploit the complementary strengths of the level-0 generalizers.
- Meta-Learning: The final estimator performs meta-learning -- it learns about the behavior of the base learners rather than learning directly from the raw features. This allows it to discover patterns such as "classifier A is reliable when classifier B is uncertain."
- Cross-Validated Meta-Features to Avoid Overfitting: The use of cross-validation to generate out-of-fold predictions for the meta-features is essential. Without this mechanism, the meta-learner would train on predictions that are unrealistically accurate (since each base learner would be predicting on data it was trained on), leading to overfitting in the stacking layer. Cross-validation ensures the meta-features reflect each base learner's true generalization ability.