Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Recommenders team Recommenders Chronological Data Splitting

From Leeroopedia


Knowledge Sources
Domains Recommender Systems, Data Splitting, Temporal Evaluation
Last Updated 2026-02-10 00:00 GMT

Overview

Time-based data splitting preserves temporal ordering so that for each user, earlier interactions go to the training set and later interactions go to the test set, thereby avoiding temporal data leakage.

Description

Chronological data splitting is a stratified splitting strategy that respects the natural time order of user-item interactions. Rather than randomly partitioning data (which risks placing future interactions in the training set and past interactions in the test set), chronological splitting sorts each user's interactions by timestamp and assigns the earliest proportion to training and the remainder to testing.

This is critical in recommender system evaluation because recommendations are fundamentally a prediction of future behavior. If a model is trained on interactions that occurred after the interactions it is being evaluated on, the evaluation is compromised by temporal data leakage -- the model has effectively seen the future. This inflates offline metrics and gives a misleading picture of real-world performance.

The split is stratified by user (or optionally by item), meaning the specified ratio is applied independently within each user's interaction history. This ensures that every user with sufficient interactions is represented in both training and test sets. A min_rating threshold can exclude users (or items) with too few interactions to produce meaningful splits.

When a list of ratios is provided (e.g., [0.6, 0.2, 0.2]), the function produces multiple splits (train/validation/test), each respecting temporal order.

Usage

Use chronological splitting whenever your dataset includes timestamps and you want to simulate a realistic temporal evaluation scenario. This is the recommended splitting strategy for sequential recommendation, session-based recommendation, and any setting where the order of interactions matters. Prefer this over random splitting for implicit feedback datasets (clicks, views, purchases) where temporal patterns are strong.

Theoretical Basis

Temporal Split Procedure

For each user u with interactions sorted by timestamp:

interactions_u = sort_by_timestamp(all_interactions_of_user_u)
n_u = len(interactions_u)
split_point = floor(ratio * n_u)

train_u = interactions_u[0 : split_point]
test_u  = interactions_u[split_point : n_u]

The global training and test sets are the union of per-user splits:

train = union(train_u for all u)
test  = union(test_u for all u)

Why Not Random Splitting?

Random splitting treats interactions as i.i.d. samples, ignoring temporal structure. Consider a user who watched Action movies in January and Comedies in February. A random split might place a February Comedy in training and a January Action in testing, allowing the model to "learn" from future preferences. Chronological splitting prevents this by enforcing max(timestamp_train_u) <= min(timestamp_test_u) for every user.

Multi-Way Splits

When the ratio is a list [r1, r2, ..., rk], the function divides each user's sorted interactions into k consecutive segments. The ratios are normalized to sum to 1 if they do not already. This supports train/validation/test workflows common in hyperparameter tuning.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment