Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:DistrictDataLabs Yellowbrick Learning Curve Analysis

From Leeroopedia


Knowledge Sources
Domains Machine_Learning, Model_Selection, Hyperparameter_Tuning
Last Updated 2026-02-08 00:00 GMT

Overview

Learning curve analysis is a diagnostic technique that evaluates how a model's training and cross-validation performance change as the number of training samples increases, revealing whether the model would benefit from more data or from increased complexity.

Description

A learning curve plots model performance (on the y-axis) against the number of training examples used to fit the model (on the x-axis). Two curves are generated: one for the training score and one for the cross-validated test score. By observing how these curves converge or diverge as more training data is added, practitioners can diagnose whether a model is suffering from high bias (underfitting) or high variance (overfitting).

The training process is repeated for a sequence of increasing training set sizes. At each size, the model is trained on the given subset and evaluated via cross-validation. The mean score across folds is plotted along with a shaded band representing one standard deviation, which indicates the variability of the estimate. This gives an empirical picture of the model's learning rate -- how quickly it improves with more data.

Learning curves provide actionable guidance. If the training and validation scores converge at a low value, the model has high bias and adding more data will not help; instead, one should increase model complexity (e.g. add features, use a more expressive model). If the training score is much higher than the validation score and the gap persists, the model has high variance; in this case, more training data may close the gap, or the model complexity should be reduced through regularization or feature reduction.

Usage

Learning curve analysis should be used when:

  • You want to determine whether collecting more training data would improve your model.
  • You suspect the model is underfitting or overfitting and want to confirm visually.
  • You need to plan data collection efforts and want to estimate how much data is sufficient.
  • You want to compare the data efficiency of different model architectures.

Theoretical Basis

Learning curves are grounded in statistical learning theory and the bias-variance decomposition. As the training set size n grows, the expected behavior of the training and generalization errors follows predictable patterns.

The training error for a model of fixed complexity tends to increase with n because it becomes harder for the model to perfectly fit more data points:

Etrain(n) as n

The generalization error (estimated by cross-validation) tends to decrease with n because the model receives a more representative sample of the underlying distribution:

Etest(n) as n

Both curves converge to the irreducible error plus the approximation error of the model class. For a model with high bias, this asymptotic value is high. For a model with high variance, the convergence is slow, meaning the gap between training and test error remains large for moderate sample sizes.

In a k-fold cross-validation setting with training subset size m:

CVk(m)=1ki=1kS(f^mi,Di)

where f^mi is the model trained on m samples from all folds except i, and Di is the validation fold.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment