Principle:Scikit learn Scikit learn Cross Validated Predictions
Metadata
- Domains: Statistics, Model_Evaluation
- Sources: scikit-learn documentation, "The Elements of Statistical Learning" Hastie et al.
- Last Updated: 2026-02-08 15:00 GMT
Overview
A prediction strategy that generates out-of-fold predictions for every sample by using each fold's held-out test partition.
While cross_val_score and cross_validate return aggregate scores per fold, cross-validated predictions return predictions for every sample in the dataset. Each sample's prediction is generated by a model that was not trained on that sample, providing out-of-fold (or out-of-sample) predictions that avoid the optimistic bias inherent in predicting on training data.
Description
How cross_val_predict works differently from cross_val_score:
- cross_val_score fits the model on each training fold, scores it on the corresponding test fold, and returns an array of k scalar scores.
- cross_val_predict fits the model on each training fold, generates predictions on the corresponding test fold, and then concatenates all test fold predictions into a single array covering every sample in the dataset. The result is that each sample has exactly one prediction, generated by a model that never saw that sample during training.
The key distinction is that cross_val_predict returns a prediction vector of the same length as the dataset, not an array of scores. This enables downstream analyses that require per-sample predictions.
Out-of-fold predictions:
The term "out-of-fold" emphasizes that each prediction is made by a model trained on all data except the fold containing that sample. This property makes the predictions:
- Unbiased at the sample level: No sample's prediction is contaminated by having been seen during training.
- Suitable for computing sample-level diagnostics: Residual plots, confusion matrices, calibration curves, and ROC curves can be constructed from these predictions.
Important caveat: Passing these predictions into a global evaluation metric (e.g., computing accuracy_score(y_true, y_pred) on the full out-of-fold predictions) may not produce the same result as the mean of per-fold scores from cross_val_score. This is because the global metric combines predictions from k different models, and the metric may not decompose additively over samples or folds.
Use cases for cross-validated predictions:
- Stacking (blending): Out-of-fold predictions from a base model serve as features for a meta-learner. This is the standard approach for constructing stacked ensembles without data leakage.
- Probability calibration: Out-of-fold predicted probabilities can be used to fit a calibration model (e.g., Platt scaling, isotonic regression) without optimistic bias.
- Visualization and diagnostics: Plotting predicted vs. actual values, residual distributions, or confusion matrices using out-of-fold predictions gives a more honest picture of model behavior than using in-sample predictions.
- Error analysis: Identifying which samples are consistently mispredicted across the cross-validation procedure.
Usage
Cross-validated predictions should be used when:
- You need per-sample predictions for diagnostic analysis (residual plots, confusion matrices).
- You are building a stacked ensemble and need unbiased first-level predictions as meta-features.
- You want to calibrate probabilities without introducing optimistic bias from in-sample predictions.
- You want to visualize predicted vs. actual values for the full dataset.
Theoretical Basis
Out-of-fold predictions approximate the predictions a model would make on genuinely unseen data. For each sample i in fold f, the prediction is generated by a model f_hat_{-f} trained on all samples not in fold f. This is analogous to the leave-one-out prediction concept, generalized to k folds.
However, unlike cross-validated scores which estimate a population-level quantity (expected loss), cross-validated predictions do not have a single clean statistical interpretation as an estimator. Different samples are predicted by different models (each trained on a different (k-1)/k subset), so the combined prediction vector is a composite of k distinct models rather than predictions from a single model.