Implementation:Scikit learn Scikit learn Train Test Split

Field	Value
source	scikit-learn\|https://github.com/scikit-learn/scikit-learn
domains	Data_Science, Machine_Learning
last_updated	2026-02-08 15:00 GMT

Overview

Concrete tool for splitting arrays into random train and test subsets provided by scikit-learn.

Description

The train_test_split function is a convenience wrapper that combines input validation, index generation via ShuffleSplit (or StratifiedShuffleSplit when stratify is provided), and array indexing into a single function call. It accepts one or more indexable sequences (arrays, DataFrames, sparse matrices) of the same length and returns their train-test splits as a flat list.

Usage

Splitting feature matrices and target vectors before model training.
Creating reproducible holdout splits by setting random_state.
Preserving class distributions with stratified splits via the stratify parameter.

Code Reference

Source Location

sklearn/model_selection/_split.py, function train_test_split

Signature

def train_test_split(
    *arrays,
    test_size=None,
    train_size=None,
    random_state=None,
    shuffle=True,
    stratify=None,
):

Import

from sklearn.model_selection import train_test_split

I/O Contract

Inputs

Parameter	Type	Default	Description
`*arrays`	sequence of indexables	(required)	One or more arrays, lists, sparse matrices, or pandas DataFrames with the same `shape[0]`. At least one array must be provided.
`test_size`	float or int or None	`None`	If float, the proportion of the dataset for the test split (0.0 to 1.0 exclusive). If int, the absolute number of test samples. If None, defaults to the complement of `train_size`; if both are None, defaults to `0.25`.
`train_size`	float or int or None	`None`	If float, the proportion for the train split. If int, the absolute number of train samples. If None, set to the complement of `test_size`.
`random_state`	int, RandomState, or None	`None`	Controls the shuffling applied before splitting. Pass an int for reproducible output across multiple calls.
`shuffle`	bool	`True`	Whether to shuffle the data before splitting. If `False`, `stratify` must be `None`.
`stratify`	array-like or None	`None`	If not None, data is split in a stratified fashion using this array as class labels.

Outputs

Return	Type	Description
splitting	`list`, length = `2 * len(arrays)`	A list containing the train-test split of each input array. For each input array, the train portion appears first followed by the test portion. E.g., passing `X, y` returns `[X_train, X_test, y_train, y_test]`.

Usage Examples

Basic split of feature matrix and target vector:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
print(X_train.shape)  # (105, 4)
print(X_test.shape)   # (45, 4)

Stratified split to preserve class proportions:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

Split without shuffling:

from sklearn.model_selection import train_test_split

data = list(range(10))
train, test = train_test_split(data, shuffle=False)
print(train)  # [0, 1, 2, 3, 4, 5, 6]
print(test)   # [7, 8, 9]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment