Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Scikit learn Scikit learn Train Test Splitting

From Leeroopedia


Field Value
sources Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed., Springer; scikit-learn documentation: https://scikit-learn.org/stable/modules/cross_validation.html
domains Data_Science, Machine_Learning, Statistics
last_updated 2026-02-08 15:00 GMT

Overview

A statistical sampling technique that partitions a dataset into non-overlapping training and testing subsets.

Description

Train-test splitting is the practice of dividing an available dataset into two disjoint subsets before any model fitting takes place:

  • The training set is used to fit model parameters (e.g., weights, coefficients). The model sees this data during the learning process.
  • The test set is held back and used exclusively to evaluate how well the trained model generalizes to unseen observations.

This separation is essential because evaluating a model on the same data it was trained on produces overly optimistic performance estimates. A model that memorizes its training data (overfitting) may score perfectly on training samples yet fail on new data. The holdout test set provides an unbiased estimate of generalization error.

Stratification is an important refinement of random splitting. When performing stratified splitting, the class distribution of the target variable is preserved in both the training and testing subsets. This is particularly valuable when dealing with imbalanced datasets where one class may be significantly underrepresented. Without stratification, a random split could, by chance, place all samples of a minority class into one subset.

Usage

Use train-test splitting when:

  • Simple holdout evaluation -- A single split is sufficient for quick model assessment, especially with large datasets where statistical variability from the split is low.
  • Before any model training -- The split must happen before any data-dependent decisions (feature selection, hyperparameter tuning) to prevent data leakage.
  • Stratification is needed -- Use the stratify parameter to maintain class proportions across subsets, particularly for imbalanced classification tasks.

Consider cross-validation (e.g., k-fold) instead of a single train-test split when the dataset is small and a single split may produce high-variance performance estimates.

Theoretical Basis

Random Sampling

In simple random splitting, each sample has an equal probability of being assigned to either subset. Given a dataset of n samples and a desired test fraction p, the procedure selects np samples uniformly at random for the test set and assigns the remaining samples to the training set. Shuffling is controlled by a random seed (random_state) to ensure reproducibility.

Stratified Sampling

Stratified splitting extends random sampling by performing the random partitioning independently within each class. If class k has nk samples, approximately nkp of them are placed into the test set. This guarantees that the class ratio nk/n is approximately preserved in both subsets.

Formally, for each class k{1,,K}:

|Sktrain||Strain||Sktest||Stest|nkn

where Sktrain and Sktest denote the samples of class k in each subset.

Default Behavior

When neither test_size nor train_size is specified, scikit-learn defaults to a 75/25 train/test split (i.e., test_size=0.25).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment