Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Scikit learn Scikit learn KFold Init

From Leeroopedia


Metadata

  • Domains: Statistics, Model_Evaluation
  • Source File: sklearn/model_selection/_split.py
  • Last Updated: 2026-02-08 15:00 GMT

Overview

Concrete tool for creating k-fold cross-validation index splitters provided by scikit-learn. This implementation encompasses three classes -- KFold, StratifiedKFold, and GroupKFold -- each of which partitions data indices into k train-test pairs according to different splitting strategies.

API Signatures

KFold

from sklearn.model_selection import KFold

KFold(n_splits=5, *, shuffle=False, random_state=None)

Parameters:

  • n_splits (int, default=5) -- Number of folds. Must be at least 2.
  • shuffle (bool, default=False) -- Whether to shuffle the data before splitting into batches. Note that the samples within each split will not be shuffled.
  • random_state (int, RandomState instance or None, default=None) -- When shuffle is True, controls the randomness of the index ordering. Pass an int for reproducible output across multiple function calls.

Key Methods:

  • split(X, y=None, groups=None) -- Generates (train_index, test_index) pairs.
  • get_n_splits(X=None, y=None, groups=None) -- Returns the number of splitting iterations.

Fold Size Distribution: The first n_samples % n_splits folds have size n_samples // n_splits + 1; the remaining folds have size n_samples // n_splits.

Example:

import numpy as np
from sklearn.model_selection import KFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=2)

for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold {i}: Train={train_index}, Test={test_index}")
# Fold 0: Train=[2 3], Test=[0 1]
# Fold 1: Train=[0 1], Test=[2 3]

StratifiedKFold

from sklearn.model_selection import StratifiedKFold

StratifiedKFold(n_splits=5, *, shuffle=False, random_state=None)

Parameters:

  • n_splits (int, default=5) -- Number of folds. Must be at least 2.
  • shuffle (bool, default=False) -- Whether to shuffle each class's samples before splitting into batches.
  • random_state (int, RandomState instance or None, default=None) -- Controls randomness when shuffle=True.

Behavior: Returns stratified folds where each fold preserves the percentage of samples for each class in y. This is a variation of KFold designed for binary or multiclass classification tasks with potentially imbalanced class distributions.

Example:

import numpy as np
from sklearn.model_selection import StratifiedKFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(n_splits=2)

for i, (train_index, test_index) in enumerate(skf.split(X, y)):
    print(f"Fold {i}: Train={train_index}, Test={test_index}")
# Fold 0: Train=[1 3], Test=[0 2]
# Fold 1: Train=[0 2], Test=[1 3]

GroupKFold

from sklearn.model_selection import GroupKFold

GroupKFold(n_splits=5, *, shuffle=False, random_state=None)

Parameters:

  • n_splits (int, default=5) -- Number of folds. Must be at least 2.
  • shuffle (bool, default=False) -- Whether to shuffle the groups before splitting into batches. (Added in version 1.6.)
  • random_state (int, RandomState instance or None, default=None) -- Controls randomness when shuffle=True. (Added in version 1.6.)

Behavior: Each group appears in exactly one test fold across all folds. The number of distinct groups must be at least equal to n_splits. When shuffle=False, groups are distributed to folds by assigning the largest groups first to the lightest fold, producing approximately balanced folds by sample count. When shuffle=True, groups are randomly permuted before being split.

Key Difference: The split method requires a groups parameter: split(X, y=None, groups=None).

Example:

import numpy as np
from sklearn.model_selection import GroupKFold

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])
y = np.array([1, 2, 3, 4, 5, 6])
groups = np.array([0, 0, 2, 2, 3, 3])
gkf = GroupKFold(n_splits=2)

for i, (train_index, test_index) in enumerate(gkf.split(X, y, groups)):
    print(f"Fold {i}: Train={train_index}, Test={test_index}")

Choosing Between Variants

Variant When to Use Requires
KFold Regression tasks or balanced classification X only
StratifiedKFold Classification with imbalanced classes X and y
GroupKFold Correlated samples within groups (e.g., patients, sessions) X, y, and groups

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment