Principle:Scikit learn Scikit learn Score Distribution Analysis
Metadata
- Domains: Statistics, Model_Evaluation
- Sources: scikit-learn documentation, "An Introduction to Statistical Learning" James et al., "All of Statistics" Wasserman
- Last Updated: 2026-02-08 15:00 GMT
Overview
A statistical summary that characterizes the central tendency and variability of cross-validated scores across folds.
After running cross-validation, the raw output is an array of k scores, one per fold. Score distribution analysis transforms this array into meaningful summary statistics -- mean, standard deviation, and confidence intervals -- that communicate both how well the model performs on average and how stable that performance is across different data partitions.
Description
Why the mean alone is insufficient:
Reporting only the mean cross-validation score discards critical information about the reliability of the estimate:
- A mean accuracy of 0.85 could come from folds scoring [0.84, 0.85, 0.86, 0.85, 0.85] (very stable) or [0.70, 0.95, 0.80, 0.90, 0.90] (highly variable). The practical implications for deployment differ dramatically.
- Without variability information, it is impossible to determine whether the difference between two models (e.g., 0.85 vs. 0.83) is statistically meaningful or within the noise of the evaluation procedure.
- Stakeholders need to understand the worst-case fold performance, not just the average, to assess deployment risk.
Standard deviation of fold scores:
The standard deviation across folds quantifies the spread of performance estimates. A small standard deviation indicates that the model performs consistently regardless of which data partition is used for testing. A large standard deviation may signal:
- Data heterogeneity: Different regions of the feature space have different prediction difficulty.
- Small dataset effects: With limited data, each fold's composition can vary substantially.
- Model instability: The model is sensitive to the specific samples in the training set.
Confidence intervals:
A confidence interval provides a range within which the true generalization performance is likely to fall. Under the assumption that fold scores are approximately normally distributed, an approximate 95% confidence interval for the mean score is:
mean +/- 1.96 * (std / sqrt(k))
where k is the number of folds. This interval shrinks with more folds (larger k) and with lower variability (smaller standard deviation), giving a quantitative measure of estimation precision.
Usage
Score distribution analysis should be applied whenever cross-validation results are reported or used for decision-making:
- Model comparison: When choosing between candidate models, compare both mean scores and their confidence intervals. If the intervals overlap substantially, the models may not be meaningfully different.
- Reporting results: Always report mean and standard deviation (e.g., "accuracy = 0.85 +/- 0.03") rather than the mean alone.
- Detecting instability: A large standard deviation relative to the mean may indicate that the model or evaluation setup needs further investigation.
Theoretical Basis
Sample mean:
The cross-validation mean score is the arithmetic average of the k fold scores:
mean = (1/k) * sum_{i=1}^{k} s_i
This is an unbiased estimator of the expected fold score, though it is a slightly biased estimator of the true generalization error (due to each training set using only (k-1)/k of the data).
Sample standard deviation:
The sample standard deviation of fold scores is:
std = sqrt( (1/(k-1)) * sum_{i=1}^{k} (s_i - mean)^2 )
This measures the typical deviation of a single fold's score from the mean. Note that with small k (e.g., 5 or 10), the estimate of standard deviation itself has high uncertainty.
Central limit theorem for fold scores:
When k is moderate to large and the fold scores are approximately independent and identically distributed, the central limit theorem implies that the sample mean is approximately normally distributed. This justifies the use of normal-distribution-based confidence intervals. However, in practice, fold scores are not fully independent (due to overlapping training sets), so the confidence interval should be interpreted as an approximation rather than an exact probability statement.