Implementation:Recommenders team Recommenders Python Stratified Split
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Data Splitting, Evaluation Methodology |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for performing stratified train/test splitting of user-item interaction data provided by the recommenders library.
Description
The python_stratified_split function splits a pandas DataFrame of user-item interactions into training and test sets while preserving per-user (or per-item) rating proportions. It delegates to an internal stratification routine that groups the data by the specified entity (user or item), filters out entities with fewer interactions than a minimum threshold, and performs a randomized proportional split within each group. The function supports both two-way splits (single float ratio) and multi-way splits (list of float ratios).
Usage
Import and call this function after loading your dataset and before model training. It is used to create reproducible, stratified train/test splits that ensure every user (or item) is represented in both splits.
Code Reference
Source Location
- Repository: recommenders
- File:
recommenders/datasets/python_splitters.py - Lines: L161-L201
Signature
def python_stratified_split(
data,
ratio=0.75,
min_rating=1,
filter_by="user",
col_user=DEFAULT_USER_COL,
col_item=DEFAULT_ITEM_COL,
seed=42,
) -> list[pd.DataFrame]
Import
from recommenders.datasets.python_splitters import python_stratified_split
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | pd.DataFrame | Yes | User-item interaction DataFrame to be split. |
| ratio | float or list of float | No (default: 0.75) | Split ratio. A single float produces a two-way split (train/test). A list of floats produces multiple splits. Ratios are normalized to sum to 1 if they do not already. |
| min_rating | int | No (default: 1) | Minimum number of ratings a user or item must have to be included in the split. Entities below this threshold are filtered out. |
| filter_by | str | No (default: "user") | Entity to stratify and filter by. Either "user" or "item". |
| col_user | str | No (default: DEFAULT_USER_COL) | Column name for user IDs. |
| col_item | str | No (default: DEFAULT_ITEM_COL) | Column name for item IDs. |
| seed | int | No (default: 42) | Random seed for reproducible splits. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | list[pd.DataFrame] | List of DataFrames corresponding to each split. For a single float ratio, returns a list of two DataFrames [train, test]. For a list of ratios, returns one DataFrame per ratio element. |
Usage Examples
Basic Usage
from recommenders.datasets.python_splitters import python_stratified_split
# Two-way 75/25 stratified split by user
train, test = python_stratified_split(data, ratio=0.75, seed=42)
# Three-way split (train/val/test) with 60/20/20 ratio
train, val, test = python_stratified_split(data, ratio=[0.6, 0.2, 0.2])
# Stratify by item instead of user
train, test = python_stratified_split(data, ratio=0.75, filter_by="item")
# Filter out users with fewer than 5 ratings
train, test = python_stratified_split(data, ratio=0.75, min_rating=5)
Dependencies
- numpy - Random number generation
- pandas - DataFrame manipulation and groupby operations
- sklearn - Stratified splitting utilities (via internal delegation)