Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Recommenders team Recommenders Stratified Data Splitting

From Leeroopedia


Knowledge Sources
Domains Recommender Systems, Data Splitting, Evaluation Methodology
Last Updated 2026-02-10 00:00 GMT

Overview

Stratified data splitting is a technique that partitions user-item interaction data into training and test sets while preserving the distribution of ratings per user (or per item), preventing data leakage and ensuring fair evaluation of recommender systems.

Description

In recommender system evaluation, naively splitting data at random can produce train/test sets where some users have all their interactions in one split and none in the other. This leads to two problems:

  1. Cold-start contamination: Users with no training data cannot be modeled, artificially deflating performance metrics.
  2. Distribution mismatch: The ratio of interactions per user in the test set may not reflect the training set, making evaluation unreliable.

Stratified splitting solves these problems by partitioning data within each user (or item) group. For each user, a specified proportion of their ratings is placed in the training set and the remainder in the test set. This guarantees that every user has at least some ratings in both splits, and that per-user proportions are maintained.

An optional minimum-rating filter removes users (or items) who have too few interactions to meaningfully split, further ensuring data quality.

Usage

Use stratified splitting when:

  • You are evaluating a collaborative filtering model and need every user to appear in both train and test sets.
  • You want to control the exact ratio of train-to-test interactions on a per-user or per-item basis.
  • You need reproducible splits with a fixed random seed.
  • You want to filter out users or items with insufficient interaction history before splitting.

Theoretical Basis

Given a dataset D of user-item interactions and a split ratio r(0,1), stratified splitting operates as follows:

function stratified_split(D, ratio, min_rating, filter_by, seed):
    # Step 1: Filter entities with insufficient interactions
    if filter_by == "user":
        D = {(u, i, r, t) in D : count(u in D) >= min_rating}
    else:
        D = {(u, i, r, t) in D : count(i in D) >= min_rating}

    # Step 2: Group by the stratification entity
    groups = group_by(D, key=filter_by)

    # Step 3: For each group, perform a random split preserving the ratio
    train, test = empty, empty
    for group in groups:
        shuffled = random_shuffle(group, seed)
        split_point = floor(len(group) * ratio)
        train += shuffled[:split_point]
        test  += shuffled[split_point:]

    return train, test

Key properties of stratified splitting:

  • Per-user balance: If a user has n ratings, approximately n×r go to training and the rest to testing.
  • No data leakage: Each rating appears in exactly one split.
  • Determinism: A fixed seed produces identical splits across runs.
  • Multi-split support: The ratio can be a list (e.g., [0.6, 0.2, 0.2]) to produce more than two splits (train/validation/test).

The technique is analogous to stratified k-fold cross-validation in classification, where class proportions are preserved across folds. Here, the "class" is the user (or item) identity.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment