Principle:Scikit learn Scikit learn Train Test Splitting

Field	Value
sources	Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed., Springer; scikit-learn documentation: https://scikit-learn.org/stable/modules/cross_validation.html
domains	Data_Science, Machine_Learning, Statistics
last_updated	2026-02-08 15:00 GMT

Overview

A statistical sampling technique that partitions a dataset into non-overlapping training and testing subsets.

Description

Train-test splitting is the practice of dividing an available dataset into two disjoint subsets before any model fitting takes place:

The training set is used to fit model parameters (e.g., weights, coefficients). The model sees this data during the learning process.
The test set is held back and used exclusively to evaluate how well the trained model generalizes to unseen observations.

This separation is essential because evaluating a model on the same data it was trained on produces overly optimistic performance estimates. A model that memorizes its training data (overfitting) may score perfectly on training samples yet fail on new data. The holdout test set provides an unbiased estimate of generalization error.

Stratification is an important refinement of random splitting. When performing stratified splitting, the class distribution of the target variable is preserved in both the training and testing subsets. This is particularly valuable when dealing with imbalanced datasets where one class may be significantly underrepresented. Without stratification, a random split could, by chance, place all samples of a minority class into one subset.

Usage

Use train-test splitting when:

Simple holdout evaluation -- A single split is sufficient for quick model assessment, especially with large datasets where statistical variability from the split is low.
Before any model training -- The split must happen before any data-dependent decisions (feature selection, hyperparameter tuning) to prevent data leakage.
Stratification is needed -- Use the stratify parameter to maintain class proportions across subsets, particularly for imbalanced classification tasks.

Consider cross-validation (e.g., k-fold) instead of a single train-test split when the dataset is small and a single split may produce high-variance performance estimates.

Theoretical Basis

Random Sampling

In simple random splitting, each sample has an equal probability of being assigned to either subset. Given a dataset of $n$ samples and a desired test fraction $p$ , the procedure selects $⌊ n \cdot p ⌋$ samples uniformly at random for the test set and assigns the remaining samples to the training set. Shuffling is controlled by a random seed (random_state) to ensure reproducibility.

Stratified Sampling

Stratified splitting extends random sampling by performing the random partitioning independently within each class. If class $k$ has $n_{k}$ samples, approximately $⌊ n_{k} \cdot p ⌋$ of them are placed into the test set. This guarantees that the class ratio $n_{k} / n$ is approximately preserved in both subsets.

Formally, for each class $k \in {1, \dots, K}$ :

$\frac{| S_{k}^{train} |}{| S^{train} |} \approx \frac{| S_{k}^{test} |}{| S^{test} |} \approx \frac{n_{k}}{n}$

where $S_{k}^{train}$ and $S_{k}^{test}$ denote the samples of class $k$ in each subset.

Default Behavior

When neither test_size nor train_size is specified, scikit-learn defaults to a 75/25 train/test split (i.e., test_size=0.25).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment