Principle:Huggingface Datasets Stratified Splitting

Knowledge Sources	Huggingface Datasets HF Datasets Docs
Domains	Data_Engineering, NLP
Last Updated	2026-02-14 18:00 GMT

Overview

Stratified Splitting implements stratified shuffle-split sampling that preserves label proportions across train and test splits, ensuring that both subsets are representative of the overall class distribution in classification tasks.

Description

When splitting a labeled dataset into training and test subsets, a naive random split can produce subsets where the class distribution differs significantly from the original dataset -- especially when some classes are rare. Stratified splitting addresses this by partitioning the data within each class independently, so that the proportion of each class in the train and test sets closely matches the proportion in the full dataset.

The implementation generates index arrays for train and test sets by first grouping all examples by their label value, shuffling within each group, and then allocating a specified fraction of each group to the test set and the remainder to the train set. This per-class splitting ensures that even minority classes are represented in both subsets. The algorithm handles edge cases where a class has very few samples by ensuring that at least one sample from each class appears in each split, preventing the situation where a rare class is entirely absent from the test set (which would make evaluation for that class impossible).

The resulting index arrays can be used to select rows from the original dataset, producing train and test subsets that maintain the statistical properties of the full dataset. This is particularly important for imbalanced classification problems, where a random split could place all examples of a rare class into one subset, leading to biased training or unreliable evaluation metrics.

Usage

Use Stratified Splitting when:

You are splitting a classification dataset and need both train and test sets to reflect the original class distribution.
Your dataset has imbalanced classes where a random split might exclude rare classes from one of the subsets.
You are evaluating a classifier and need the test set to contain examples of every class for meaningful per-class metrics.
You are using dataset.train_test_split(stratify_by_column=...) and want to understand the underlying sampling mechanism.

Theoretical Basis

Stratified sampling is a well-established technique in statistics and machine learning, rooted in the principle that a sample should be representative of the population it is drawn from. In the context of supervised learning, the "population" is the labeled dataset, and representativeness means that the class distribution in each split mirrors the overall distribution.

Formally, given a dataset with K classes where class k has n_k examples and the desired test fraction is p, stratified splitting allocates max(1, round(n_k * p)) examples from class k to the test set. The max(1, ...) constraint ensures that no class is entirely excluded from either split. This approach minimizes the variance of per-class evaluation metrics compared to unstratified random splitting, and is recommended as a default practice for classification tasks in standard machine learning references such as Hastie, Tibshirani, and Friedman's The Elements of Statistical Learning.

Related Pages

Implemented By

Implementation:Huggingface_Datasets_Stratify_Utils

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment