Implementation:Scikit learn Scikit learn Train Test Split
| Field | Value |
|---|---|
| source | scikit-learn|https://github.com/scikit-learn/scikit-learn |
| domains | Data_Science, Machine_Learning |
| last_updated | 2026-02-08 15:00 GMT |
Overview
Concrete tool for splitting arrays into random train and test subsets provided by scikit-learn.
Description
The train_test_split function is a convenience wrapper that combines input validation, index generation via ShuffleSplit (or StratifiedShuffleSplit when stratify is provided), and array indexing into a single function call. It accepts one or more indexable sequences (arrays, DataFrames, sparse matrices) of the same length and returns their train-test splits as a flat list.
Usage
- Splitting feature matrices and target vectors before model training.
- Creating reproducible holdout splits by setting
random_state. - Preserving class distributions with stratified splits via the
stratifyparameter.
Code Reference
Source Location
sklearn/model_selection/_split.py, function train_test_split
Signature
def train_test_split(
*arrays,
test_size=None,
train_size=None,
random_state=None,
shuffle=True,
stratify=None,
):
Import
from sklearn.model_selection import train_test_split
I/O Contract
Inputs
| Parameter | Type | Default | Description |
|---|---|---|---|
*arrays |
sequence of indexables | (required) | One or more arrays, lists, sparse matrices, or pandas DataFrames with the same shape[0]. At least one array must be provided.
|
test_size |
float or int or None | None |
If float, the proportion of the dataset for the test split (0.0 to 1.0 exclusive). If int, the absolute number of test samples. If None, defaults to the complement of train_size; if both are None, defaults to 0.25.
|
train_size |
float or int or None | None |
If float, the proportion for the train split. If int, the absolute number of train samples. If None, set to the complement of test_size.
|
random_state |
int, RandomState, or None | None |
Controls the shuffling applied before splitting. Pass an int for reproducible output across multiple calls. |
shuffle |
bool | True |
Whether to shuffle the data before splitting. If False, stratify must be None.
|
stratify |
array-like or None | None |
If not None, data is split in a stratified fashion using this array as class labels. |
Outputs
| Return | Type | Description |
|---|---|---|
| splitting | list, length = 2 * len(arrays) |
A list containing the train-test split of each input array. For each input array, the train portion appears first followed by the test portion. E.g., passing X, y returns [X_train, X_test, y_train, y_test].
|
Usage Examples
Basic split of feature matrix and target vector:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
print(X_train.shape) # (105, 4)
print(X_test.shape) # (45, 4)
Stratified split to preserve class proportions:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Split without shuffling:
from sklearn.model_selection import train_test_split
data = list(range(10))
train, test = train_test_split(data, shuffle=False)
print(train) # [0, 1, 2, 3, 4, 5, 6]
print(test) # [7, 8, 9]