Principle:Recommenders team Recommenders Stratified Data Splitting

Knowledge Sources	Recommenders Evaluation of Recommender Systems
Domains	Recommender Systems, Data Splitting, Evaluation Methodology
Last Updated	2026-02-10 00:00 GMT

Overview

Stratified data splitting is a technique that partitions user-item interaction data into training and test sets while preserving the distribution of ratings per user (or per item), preventing data leakage and ensuring fair evaluation of recommender systems.

Description

In recommender system evaluation, naively splitting data at random can produce train/test sets where some users have all their interactions in one split and none in the other. This leads to two problems:

Cold-start contamination: Users with no training data cannot be modeled, artificially deflating performance metrics.
Distribution mismatch: The ratio of interactions per user in the test set may not reflect the training set, making evaluation unreliable.

Stratified splitting solves these problems by partitioning data within each user (or item) group. For each user, a specified proportion of their ratings is placed in the training set and the remainder in the test set. This guarantees that every user has at least some ratings in both splits, and that per-user proportions are maintained.

An optional minimum-rating filter removes users (or items) who have too few interactions to meaningfully split, further ensuring data quality.

Usage

Use stratified splitting when:

You are evaluating a collaborative filtering model and need every user to appear in both train and test sets.
You want to control the exact ratio of train-to-test interactions on a per-user or per-item basis.
You need reproducible splits with a fixed random seed.
You want to filter out users or items with insufficient interaction history before splitting.

Theoretical Basis

Given a dataset $D$ of user-item interactions and a split ratio $r \in (0, 1)$ , stratified splitting operates as follows:

function stratified_split(D, ratio, min_rating, filter_by, seed):
    # Step 1: Filter entities with insufficient interactions
    if filter_by == "user":
        D = {(u, i, r, t) in D : count(u in D) >= min_rating}
    else:
        D = {(u, i, r, t) in D : count(i in D) >= min_rating}

    # Step 2: Group by the stratification entity
    groups = group_by(D, key=filter_by)

    # Step 3: For each group, perform a random split preserving the ratio
    train, test = empty, empty
    for group in groups:
        shuffled = random_shuffle(group, seed)
        split_point = floor(len(group) * ratio)
        train += shuffled[:split_point]
        test  += shuffled[split_point:]

    return train, test

Key properties of stratified splitting:

Per-user balance: If a user has $n$ ratings, approximately $⌊ n \times r ⌋$ go to training and the rest to testing.
No data leakage: Each rating appears in exactly one split.
Determinism: A fixed seed produces identical splits across runs.
Multi-split support: The ratio can be a list (e.g., [0.6, 0.2, 0.2]) to produce more than two splits (train/validation/test).

The technique is analogous to stratified k-fold cross-validation in classification, where class proportions are preserved across folds. Here, the "class" is the user (or item) identity.

Related Pages

Implemented By

Implementation:Recommenders_team_Recommenders_Python_Stratified_Split

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment