Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Scikit learn Scikit learn Train Test Split

From Leeroopedia


Field Value
source scikit-learn|https://github.com/scikit-learn/scikit-learn
domains Data_Science, Machine_Learning
last_updated 2026-02-08 15:00 GMT

Overview

Concrete tool for splitting arrays into random train and test subsets provided by scikit-learn.

Description

The train_test_split function is a convenience wrapper that combines input validation, index generation via ShuffleSplit (or StratifiedShuffleSplit when stratify is provided), and array indexing into a single function call. It accepts one or more indexable sequences (arrays, DataFrames, sparse matrices) of the same length and returns their train-test splits as a flat list.

Usage

  • Splitting feature matrices and target vectors before model training.
  • Creating reproducible holdout splits by setting random_state.
  • Preserving class distributions with stratified splits via the stratify parameter.

Code Reference

Source Location

sklearn/model_selection/_split.py, function train_test_split

Signature

def train_test_split(
    *arrays,
    test_size=None,
    train_size=None,
    random_state=None,
    shuffle=True,
    stratify=None,
):

Import

from sklearn.model_selection import train_test_split

I/O Contract

Inputs

Parameter Type Default Description
*arrays sequence of indexables (required) One or more arrays, lists, sparse matrices, or pandas DataFrames with the same shape[0]. At least one array must be provided.
test_size float or int or None None If float, the proportion of the dataset for the test split (0.0 to 1.0 exclusive). If int, the absolute number of test samples. If None, defaults to the complement of train_size; if both are None, defaults to 0.25.
train_size float or int or None None If float, the proportion for the train split. If int, the absolute number of train samples. If None, set to the complement of test_size.
random_state int, RandomState, or None None Controls the shuffling applied before splitting. Pass an int for reproducible output across multiple calls.
shuffle bool True Whether to shuffle the data before splitting. If False, stratify must be None.
stratify array-like or None None If not None, data is split in a stratified fashion using this array as class labels.

Outputs

Return Type Description
splitting list, length = 2 * len(arrays) A list containing the train-test split of each input array. For each input array, the train portion appears first followed by the test portion. E.g., passing X, y returns [X_train, X_test, y_train, y_test].

Usage Examples

Basic split of feature matrix and target vector:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)
print(X_train.shape)  # (105, 4)
print(X_test.shape)   # (45, 4)

Stratified split to preserve class proportions:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

Split without shuffling:

from sklearn.model_selection import train_test_split

data = list(range(10))
train, test = train_test_split(data, shuffle=False)
print(train)  # [0, 1, 2, 3, 4, 5, 6]
print(test)   # [7, 8, 9]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment