Implementation:Recommenders team Recommenders Spark Random Split

Knowledge Sources	Recommenders
Domains	Data Engineering, Distributed Computing
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for splitting Spark DataFrames into train/test partitions using distributed random assignment with configurable ratios and reproducible seeding.

Description

The spark_random_split function wraps PySpark's native DataFrame.randomSplit() with a convenience interface that accepts either a single float ratio (for two-way splits) or a list of floats (for multi-way splits). It delegates ratio validation and normalization to the process_split_ratio utility, then calls the underlying Spark method which performs the split distributedly across cluster partitions.

When a single float is provided (e.g., 0.75), the function creates a two-way split with [0.75, 0.25] weights. When a list is provided (e.g., [0.6, 0.2, 0.2]), the split produces the corresponding number of DataFrames. The seed parameter ensures reproducibility across runs.

Usage

Call this function after loading data with load_spark_df and before training an ALS model. The first element of the returned list is typically used as the training set and the second as the test set. For multi-way splits, the intermediate element(s) serve as validation sets.

Code Reference

Source Location

Repository: recommenders
File: recommenders/datasets/spark_splitters.py (Lines 23-45)

Signature

def spark_random_split(
    data,
    ratio=0.75,
    seed=42,
) -> list[pyspark.sql.DataFrame]

Import

from recommenders.datasets.spark_splitters import spark_random_split

I/O Contract

Inputs

Name	Type	Required	Description
data	pyspark.sql.DataFrame	Yes	Spark DataFrame containing the interaction data to split
ratio	float or list of float	No (default: 0.75)	Split ratio; a single float produces a two-way split, a list produces a multi-way split. Ratios are normalized if they do not sum to 1.0
seed	int	No (default: 42)	Random seed for reproducible splitting

Outputs

Name	Type	Description
splits	list[pyspark.sql.DataFrame]	List of Spark DataFrames; for a two-way split returns `[train_df, test_df]`, for multi-way returns `[df_1, df_2, ..., df_k]`

Usage Examples

Two-Way Train/Test Split

from recommenders.datasets.spark_splitters import spark_random_split

# Split 75% train, 25% test
train, test = spark_random_split(data, ratio=0.75, seed=42)

print(f"Train: {train.count()} rows")
print(f"Test:  {test.count()} rows")
# Train: ~75000 rows
# Test:  ~25000 rows

Multi-Way Split

from recommenders.datasets.spark_splitters import spark_random_split

# Split into train (60%), validation (20%), test (20%)
train, val, test = spark_random_split(data, ratio=[0.6, 0.2, 0.2], seed=42)

Full ALS Workflow Context

from recommenders.utils.spark_utils import start_or_get_spark
from recommenders.datasets.movielens import load_spark_df
from recommenders.datasets.spark_splitters import spark_random_split

spark = start_or_get_spark(app_name="ALS_Example")
data = load_spark_df(spark, size="100k")
train, test = spark_random_split(data, ratio=0.75, seed=42)

Related Pages

Implements Principle

Principle:Recommenders_team_Recommenders_Spark_Random_Data_Splitting

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment