Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Recommenders team Recommenders Python Stratified Split

From Leeroopedia


Knowledge Sources
Domains Recommender Systems, Data Splitting, Evaluation Methodology
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for performing stratified train/test splitting of user-item interaction data provided by the recommenders library.

Description

The python_stratified_split function splits a pandas DataFrame of user-item interactions into training and test sets while preserving per-user (or per-item) rating proportions. It delegates to an internal stratification routine that groups the data by the specified entity (user or item), filters out entities with fewer interactions than a minimum threshold, and performs a randomized proportional split within each group. The function supports both two-way splits (single float ratio) and multi-way splits (list of float ratios).

Usage

Import and call this function after loading your dataset and before model training. It is used to create reproducible, stratified train/test splits that ensure every user (or item) is represented in both splits.

Code Reference

Source Location

  • Repository: recommenders
  • File: recommenders/datasets/python_splitters.py
  • Lines: L161-L201

Signature

def python_stratified_split(
    data,
    ratio=0.75,
    min_rating=1,
    filter_by="user",
    col_user=DEFAULT_USER_COL,
    col_item=DEFAULT_ITEM_COL,
    seed=42,
) -> list[pd.DataFrame]

Import

from recommenders.datasets.python_splitters import python_stratified_split

I/O Contract

Inputs

Name Type Required Description
data pd.DataFrame Yes User-item interaction DataFrame to be split.
ratio float or list of float No (default: 0.75) Split ratio. A single float produces a two-way split (train/test). A list of floats produces multiple splits. Ratios are normalized to sum to 1 if they do not already.
min_rating int No (default: 1) Minimum number of ratings a user or item must have to be included in the split. Entities below this threshold are filtered out.
filter_by str No (default: "user") Entity to stratify and filter by. Either "user" or "item".
col_user str No (default: DEFAULT_USER_COL) Column name for user IDs.
col_item str No (default: DEFAULT_ITEM_COL) Column name for item IDs.
seed int No (default: 42) Random seed for reproducible splits.

Outputs

Name Type Description
return list[pd.DataFrame] List of DataFrames corresponding to each split. For a single float ratio, returns a list of two DataFrames [train, test]. For a list of ratios, returns one DataFrame per ratio element.

Usage Examples

Basic Usage

from recommenders.datasets.python_splitters import python_stratified_split

# Two-way 75/25 stratified split by user
train, test = python_stratified_split(data, ratio=0.75, seed=42)

# Three-way split (train/val/test) with 60/20/20 ratio
train, val, test = python_stratified_split(data, ratio=[0.6, 0.2, 0.2])

# Stratify by item instead of user
train, test = python_stratified_split(data, ratio=0.75, filter_by="item")

# Filter out users with fewer than 5 ratings
train, test = python_stratified_split(data, ratio=0.75, min_rating=5)

Dependencies

  • numpy - Random number generation
  • pandas - DataFrame manipulation and groupby operations
  • sklearn - Stratified splitting utilities (via internal delegation)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment