Principle:DistrictDataLabs Yellowbrick Learning Curve Analysis
| Knowledge Sources | |
|---|---|
| Domains | Machine_Learning, Model_Selection, Hyperparameter_Tuning |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Learning curve analysis is a diagnostic technique that evaluates how a model's training and cross-validation performance change as the number of training samples increases, revealing whether the model would benefit from more data or from increased complexity.
Description
A learning curve plots model performance (on the y-axis) against the number of training examples used to fit the model (on the x-axis). Two curves are generated: one for the training score and one for the cross-validated test score. By observing how these curves converge or diverge as more training data is added, practitioners can diagnose whether a model is suffering from high bias (underfitting) or high variance (overfitting).
The training process is repeated for a sequence of increasing training set sizes. At each size, the model is trained on the given subset and evaluated via cross-validation. The mean score across folds is plotted along with a shaded band representing one standard deviation, which indicates the variability of the estimate. This gives an empirical picture of the model's learning rate -- how quickly it improves with more data.
Learning curves provide actionable guidance. If the training and validation scores converge at a low value, the model has high bias and adding more data will not help; instead, one should increase model complexity (e.g. add features, use a more expressive model). If the training score is much higher than the validation score and the gap persists, the model has high variance; in this case, more training data may close the gap, or the model complexity should be reduced through regularization or feature reduction.
Usage
Learning curve analysis should be used when:
- You want to determine whether collecting more training data would improve your model.
- You suspect the model is underfitting or overfitting and want to confirm visually.
- You need to plan data collection efforts and want to estimate how much data is sufficient.
- You want to compare the data efficiency of different model architectures.
Theoretical Basis
Learning curves are grounded in statistical learning theory and the bias-variance decomposition. As the training set size grows, the expected behavior of the training and generalization errors follows predictable patterns.
The training error for a model of fixed complexity tends to increase with because it becomes harder for the model to perfectly fit more data points:
The generalization error (estimated by cross-validation) tends to decrease with because the model receives a more representative sample of the underlying distribution:
Both curves converge to the irreducible error plus the approximation error of the model class. For a model with high bias, this asymptotic value is high. For a model with high variance, the convergence is slow, meaning the gap between training and test error remains large for moderate sample sizes.
In a k-fold cross-validation setting with training subset size :
where is the model trained on samples from all folds except , and is the validation fold.