Principle:DistrictDataLabs Yellowbrick Learning Curve Analysis

Knowledge Sources	Yellowbrick Docs Yellowbrick
Domains	Machine_Learning, Model_Selection, Hyperparameter_Tuning
Last Updated	2026-02-08 00:00 GMT

Overview

Learning curve analysis is a diagnostic technique that evaluates how a model's training and cross-validation performance change as the number of training samples increases, revealing whether the model would benefit from more data or from increased complexity.

Description

A learning curve plots model performance (on the y-axis) against the number of training examples used to fit the model (on the x-axis). Two curves are generated: one for the training score and one for the cross-validated test score. By observing how these curves converge or diverge as more training data is added, practitioners can diagnose whether a model is suffering from high bias (underfitting) or high variance (overfitting).

The training process is repeated for a sequence of increasing training set sizes. At each size, the model is trained on the given subset and evaluated via cross-validation. The mean score across folds is plotted along with a shaded band representing one standard deviation, which indicates the variability of the estimate. This gives an empirical picture of the model's learning rate -- how quickly it improves with more data.

Learning curves provide actionable guidance. If the training and validation scores converge at a low value, the model has high bias and adding more data will not help; instead, one should increase model complexity (e.g. add features, use a more expressive model). If the training score is much higher than the validation score and the gap persists, the model has high variance; in this case, more training data may close the gap, or the model complexity should be reduced through regularization or feature reduction.

Usage

Learning curve analysis should be used when:

You want to determine whether collecting more training data would improve your model.
You suspect the model is underfitting or overfitting and want to confirm visually.
You need to plan data collection efforts and want to estimate how much data is sufficient.
You want to compare the data efficiency of different model architectures.

Theoretical Basis

Learning curves are grounded in statistical learning theory and the bias-variance decomposition. As the training set size $n$ grows, the expected behavior of the training and generalization errors follows predictable patterns.

The training error for a model of fixed complexity tends to increase with $n$ because it becomes harder for the model to perfectly fit more data points:

$E_{train} (n) ↑ as n \to \infty$

The generalization error (estimated by cross-validation) tends to decrease with $n$ because the model receives a more representative sample of the underlying distribution:

$E_{test} (n) ↓ as n \to \infty$

Both curves converge to the irreducible error plus the approximation error of the model class. For a model with high bias, this asymptotic value is high. For a model with high variance, the convergence is slow, meaning the gap between training and test error remains large for moderate sample sizes.

In a k-fold cross-validation setting with training subset size $m$ :

${CV}_{k} (m) = \frac{1}{k} \sum_{i = 1}^{k} S ({\hat{f}}_{m}^{- i}, D_{i})$

where ${\hat{f}}_{m}^{- i}$ is the model trained on $m$ samples from all folds except $i$ , and $D_{i}$ is the validation fold.

Related Pages

Implemented By

Implementation:DistrictDataLabs_Yellowbrick_LearningCurve_Visualizer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment