Principle:Recommenders team Recommenders Stratified Data Splitting
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Data Splitting, Evaluation Methodology |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Stratified data splitting is a technique that partitions user-item interaction data into training and test sets while preserving the distribution of ratings per user (or per item), preventing data leakage and ensuring fair evaluation of recommender systems.
Description
In recommender system evaluation, naively splitting data at random can produce train/test sets where some users have all their interactions in one split and none in the other. This leads to two problems:
- Cold-start contamination: Users with no training data cannot be modeled, artificially deflating performance metrics.
- Distribution mismatch: The ratio of interactions per user in the test set may not reflect the training set, making evaluation unreliable.
Stratified splitting solves these problems by partitioning data within each user (or item) group. For each user, a specified proportion of their ratings is placed in the training set and the remainder in the test set. This guarantees that every user has at least some ratings in both splits, and that per-user proportions are maintained.
An optional minimum-rating filter removes users (or items) who have too few interactions to meaningfully split, further ensuring data quality.
Usage
Use stratified splitting when:
- You are evaluating a collaborative filtering model and need every user to appear in both train and test sets.
- You want to control the exact ratio of train-to-test interactions on a per-user or per-item basis.
- You need reproducible splits with a fixed random seed.
- You want to filter out users or items with insufficient interaction history before splitting.
Theoretical Basis
Given a dataset of user-item interactions and a split ratio , stratified splitting operates as follows:
function stratified_split(D, ratio, min_rating, filter_by, seed):
# Step 1: Filter entities with insufficient interactions
if filter_by == "user":
D = {(u, i, r, t) in D : count(u in D) >= min_rating}
else:
D = {(u, i, r, t) in D : count(i in D) >= min_rating}
# Step 2: Group by the stratification entity
groups = group_by(D, key=filter_by)
# Step 3: For each group, perform a random split preserving the ratio
train, test = empty, empty
for group in groups:
shuffled = random_shuffle(group, seed)
split_point = floor(len(group) * ratio)
train += shuffled[:split_point]
test += shuffled[split_point:]
return train, test
Key properties of stratified splitting:
- Per-user balance: If a user has ratings, approximately go to training and the rest to testing.
- No data leakage: Each rating appears in exactly one split.
- Determinism: A fixed seed produces identical splits across runs.
- Multi-split support: The ratio can be a list (e.g., [0.6, 0.2, 0.2]) to produce more than two splits (train/validation/test).
The technique is analogous to stratified k-fold cross-validation in classification, where class proportions are preserved across folds. Here, the "class" is the user (or item) identity.