Implementation:Scikit learn Scikit learn KFold Init
Metadata
- Domains: Statistics, Model_Evaluation
- Source File:
sklearn/model_selection/_split.py - Last Updated: 2026-02-08 15:00 GMT
Overview
Concrete tool for creating k-fold cross-validation index splitters provided by scikit-learn. This implementation encompasses three classes -- KFold, StratifiedKFold, and GroupKFold -- each of which partitions data indices into k train-test pairs according to different splitting strategies.
API Signatures
KFold
from sklearn.model_selection import KFold
KFold(n_splits=5, *, shuffle=False, random_state=None)
Parameters:
- n_splits (int, default=5) -- Number of folds. Must be at least 2.
- shuffle (bool, default=False) -- Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled.
- random_state (int, RandomState instance or None, default=None) -- When
shuffleis True, controls the randomness of the index ordering. Pass an int for reproducible output across multiple function calls.
Key Methods:
split(X, y=None, groups=None)-- Generates (train_index, test_index) pairs.get_n_splits(X=None, y=None, groups=None)-- Returns the number of splitting iterations.
Fold Size Distribution: The first n_samples % n_splits folds have size n_samples // n_splits + 1; the remaining folds have size n_samples // n_splits.
Example:
import numpy as np
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)
for i, (train_index, test_index) in enumerate(kf.split(X)):
print(f"Fold {i}: Train={train_index}, Test={test_index}")
# Fold 0: Train=[2 3], Test=[0 1]
# Fold 1: Train=[0 1], Test=[2 3]
StratifiedKFold
from sklearn.model_selection import StratifiedKFold
StratifiedKFold(n_splits=5, *, shuffle=False, random_state=None)
Parameters:
- n_splits (int, default=5) -- Number of folds. Must be at least 2.
- shuffle (bool, default=False) -- Whether to shuffle each class's samples before splitting into batches.
- random_state (int, RandomState instance or None, default=None) -- Controls randomness when
shuffle=True.
Behavior: Returns stratified folds where each fold preserves the percentage of samples for each class in y. This is a variation of KFold designed for binary or multiclass classification tasks with potentially imbalanced class distributions.
Example:
import numpy as np
from sklearn.model_selection import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(n_splits=2)
for i, (train_index, test_index) in enumerate(skf.split(X, y)):
print(f"Fold {i}: Train={train_index}, Test={test_index}")
# Fold 0: Train=[1 3], Test=[0 2]
# Fold 1: Train=[0 2], Test=[1 3]
GroupKFold
from sklearn.model_selection import GroupKFold
GroupKFold(n_splits=5, *, shuffle=False, random_state=None)
Parameters:
- n_splits (int, default=5) -- Number of folds. Must be at least 2.
- shuffle (bool, default=False) -- Whether to shuffle the groups before splitting into batches. (Added in version 1.6.)
- random_state (int, RandomState instance or None, default=None) -- Controls randomness when
shuffle=True. (Added in version 1.6.)
Behavior: Each group appears in exactly one test fold across all folds. The number of distinct groups must be at least equal to n_splits. When shuffle=False, groups are distributed to folds by assigning the largest groups first to the lightest fold, producing approximately balanced folds by sample count. When shuffle=True, groups are randomly permuted before being split.
Key Difference: The split method requires a groups parameter: split(X, y=None, groups=None).
Example:
import numpy as np
from sklearn.model_selection import GroupKFold
X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([0, 0, 2, 2, 3, 3])
gkf = GroupKFold(n_splits=2)
for i, (train_index, test_index) in enumerate(gkf.split(X, y, groups)):
print(f"Fold {i}: Train={train_index}, Test={test_index}")
Choosing Between Variants
| Variant | When to Use | Requires |
|---|---|---|
KFold |
Regression tasks or balanced classification | X only |
StratifiedKFold |
Classification with imbalanced classes | X and y |
GroupKFold |
Correlated samples within groups (e.g., patients, sessions) | X, y, and groups |