Principle:Scikit learn Scikit learn KFold Splitting
Metadata
- Domains: Statistics, Model_Evaluation
- Sources: scikit-learn documentation, "The Elements of Statistical Learning" Hastie et al.
- Last Updated: 2026-02-08 15:00 GMT
Overview
A resampling strategy that partitions data into k equal-sized folds for iterative train-test evaluation.
K-fold cross-validation is one of the most widely used techniques for estimating how well a predictive model will generalize to unseen data. Rather than relying on a single train-test split, which can produce unstable estimates depending on how the data happens to be divided, k-fold cross-validation systematically rotates through multiple partitions so that every observation serves as both a training sample and a test sample exactly once.
Description
In k-fold cross-validation, the full dataset of n samples is divided into k non-overlapping subsets (folds) of approximately equal size. The procedure then iterates k times. In each iteration, one fold is held out as the test set and the remaining k - 1 folds are combined to form the training set. A model is fit on the training set and evaluated on the held-out fold. After all k iterations, the result is k performance scores, one per fold.
Why k=5 or k=10 are common choices:
- k=5 provides a good balance between computational cost and estimation reliability. Each training set uses 80% of the data, which is typically sufficient for stable model fitting.
- k=10 is the classical recommendation from the statistics literature (Kohavi, 1995; Hastie et al., 2009). It uses 90% of the data for training in each fold, producing a slightly less biased estimate of performance at the cost of higher variance across folds and greater computation.
- Smaller values of k (e.g., k=2 or k=3) produce training sets that are substantially smaller than the full dataset, leading to pessimistically biased performance estimates. Larger values (e.g., k=n, which is leave-one-out) minimize bias but can exhibit high variance and are computationally expensive.
Variants:
- Standard KFold: Splits data into k consecutive folds without regard to the target variable. Suitable for regression tasks or when class distributions are roughly balanced.
- Stratified KFold: Ensures that each fold preserves the same proportion of samples for each class label as in the full dataset. This is critical for classification tasks with imbalanced classes, where a naive split could produce folds with missing classes.
- Group KFold: Ensures that samples belonging to the same group (e.g., the same patient, the same experiment, the same geographic region) never appear in both the training and test sets of the same fold. This prevents data leakage when observations within a group are correlated.
Usage
K-fold splitting should be used when:
- You need unbiased performance estimates with limited data and cannot afford to set aside a large hold-out set.
- You want to compare multiple models or tune hyperparameters and need a reliable estimate of generalization error for each configuration.
- Your dataset has class imbalance (use StratifiedKFold) or grouped structure (use GroupKFold) that must be respected during evaluation.
- You are performing model selection and wish to reduce the variance of your performance estimate relative to a single random split.
Theoretical Basis
Bias-Variance Tradeoff in Fold Count Selection:
The choice of k involves a tradeoff between bias and variance in the estimated generalization error:
- Bias: With small k, each training set is only a fraction (k-1)/k of the full dataset. Models trained on smaller subsets tend to underperform relative to models trained on the full dataset, so the cross-validation estimate is pessimistically biased. As k increases toward n (leave-one-out), the training set size approaches the full dataset size and bias decreases.
- Variance: With large k, the k training sets overlap substantially, meaning the models trained in different folds are highly correlated. This correlation inflates the variance of the average score. With moderate k (5 or 10), the training sets are more independent, leading to lower variance in the combined estimate.
Stratification and Class Proportions:
In classification settings, stratification ensures that each fold mirrors the overall class distribution. Without stratification, random partitioning can produce folds where minority classes are absent or severely underrepresented, leading to:
- Unreliable per-fold scores that oscillate depending on class presence.
- Biased aggregate estimates that do not reflect the model's true expected performance on the population distribution.
Stratification preserves the marginal distribution P(y) in every fold, providing more stable and representative evaluation.