Principle:DistrictDataLabs Yellowbrick Cross Validation Scoring

Knowledge Sources	Yellowbrick Docs Yellowbrick
Domains	Machine_Learning, Model_Selection, Hyperparameter_Tuning
Last Updated	2026-02-08 00:00 GMT

Overview

Cross-validation scoring is a model evaluation technique that partitions data into multiple train/test splits and aggregates performance scores across all splits, providing a robust estimate of model generalization performance with a measure of variability.

Description

In machine learning, evaluating a model on a single train/test split can produce misleading results because the score depends heavily on how the data was partitioned. Cross-validation addresses this by systematically splitting the dataset into k disjoint folds, training the model on k-1 folds, and evaluating on the held-out fold. This process is repeated k times so that every fold serves as the test set exactly once. The resulting k scores provide a distribution of performance estimates rather than a single point estimate.

The most common form is k-fold cross-validation, where the dataset is divided into k equally sized partitions. Other strategies include stratified k-fold (which preserves class proportions in each fold), leave-one-out (where k equals the number of samples), and group-aware splits (which ensure samples from the same group do not appear in both training and test sets). The choice of k involves a tradeoff: larger k values produce lower-bias estimates but higher variance and greater computational cost.

Visualizing the individual fold scores alongside their mean provides insight beyond a single aggregate number. A bar chart of fold scores reveals whether performance is consistent across splits or whether certain data partitions are substantially harder for the model. High variability across folds may indicate that the model is sensitive to the particular training samples, that the data contains heterogeneous subpopulations, or that the dataset is too small for reliable estimation.

Usage

Cross-validation scoring should be used when:

You want a reliable estimate of model generalization performance that accounts for data variability.
You are comparing multiple models and need a fair assessment methodology.
You want to detect whether your model performance is stable or highly variable across different data splits.
You need to report model performance with confidence intervals or variability measures.

Theoretical Basis

In k-fold cross-validation, the dataset $D = {(x_{i}, y_{i})}_{i = 1}^{n}$ is partitioned into k disjoint subsets $D_{1}, D_{2}, \dots, D_{k}$ of approximately equal size. For each fold $i$ :

${\hat{f}}^{- i} = Train (D ∖ D_{i})$

$s_{i} = S ({\hat{f}}^{- i}, D_{i})$

where $S$ is the scoring function. The cross-validated score and its standard deviation are:

$\bar{s} = \frac{1}{k} \sum_{i = 1}^{k} s_{i}$

$σ_{s} = \sqrt{\frac{1}{k} \sum_{i = 1}^{k} (s_{i} - \bar{s})^{2}}$

The cross-validated score $\bar{s}$ is an approximately unbiased estimate of the model's expected performance on unseen data drawn from the same distribution. However, because the training sets overlap (each pair shares $(k - 2) / k$ of the data), the individual fold scores are correlated, which means the standard deviation $σ_{s}$ may underestimate the true variability of the performance estimate.

The expected generalization error can be decomposed as:

$E [Error] = {Bias}^{2} + Variance + Noise$

Cross-validation helps estimate this total error empirically. With $k = n$ (leave-one-out), the estimate has low bias but high variance; with small $k$ (e.g. 5 or 10), the estimate trades a small amount of bias for substantially lower variance.

Related Pages

Implemented By

Implementation:DistrictDataLabs_Yellowbrick_CVScores_Visualizer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment