Implementation:Recommenders team Recommenders Spark Random Split
| Knowledge Sources | |
|---|---|
| Domains | Data Engineering, Distributed Computing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for splitting Spark DataFrames into train/test partitions using distributed random assignment with configurable ratios and reproducible seeding.
Description
The spark_random_split function wraps PySpark's native DataFrame.randomSplit() with a convenience interface that accepts either a single float ratio (for two-way splits) or a list of floats (for multi-way splits). It delegates ratio validation and normalization to the process_split_ratio utility, then calls the underlying Spark method which performs the split distributedly across cluster partitions.
When a single float is provided (e.g., 0.75), the function creates a two-way split with [0.75, 0.25] weights. When a list is provided (e.g., [0.6, 0.2, 0.2]), the split produces the corresponding number of DataFrames. The seed parameter ensures reproducibility across runs.
Usage
Call this function after loading data with load_spark_df and before training an ALS model. The first element of the returned list is typically used as the training set and the second as the test set. For multi-way splits, the intermediate element(s) serve as validation sets.
Code Reference
Source Location
- Repository: recommenders
- File: recommenders/datasets/spark_splitters.py (Lines 23-45)
Signature
def spark_random_split(
data,
ratio=0.75,
seed=42,
) -> list[pyspark.sql.DataFrame]
Import
from recommenders.datasets.spark_splitters import spark_random_split
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| data | pyspark.sql.DataFrame | Yes | Spark DataFrame containing the interaction data to split |
| ratio | float or list of float | No (default: 0.75) | Split ratio; a single float produces a two-way split, a list produces a multi-way split. Ratios are normalized if they do not sum to 1.0 |
| seed | int | No (default: 42) | Random seed for reproducible splitting |
Outputs
| Name | Type | Description |
|---|---|---|
| splits | list[pyspark.sql.DataFrame] | List of Spark DataFrames; for a two-way split returns [train_df, test_df], for multi-way returns [df_1, df_2, ..., df_k]
|
Usage Examples
Two-Way Train/Test Split
from recommenders.datasets.spark_splitters import spark_random_split
# Split 75% train, 25% test
train, test = spark_random_split(data, ratio=0.75, seed=42)
print(f"Train: {train.count()} rows")
print(f"Test: {test.count()} rows")
# Train: ~75000 rows
# Test: ~25000 rows
Multi-Way Split
from recommenders.datasets.spark_splitters import spark_random_split
# Split into train (60%), validation (20%), test (20%)
train, val, test = spark_random_split(data, ratio=[0.6, 0.2, 0.2], seed=42)
Full ALS Workflow Context
from recommenders.utils.spark_utils import start_or_get_spark
from recommenders.datasets.movielens import load_spark_df
from recommenders.datasets.spark_splitters import spark_random_split
spark = start_or_get_spark(app_name="ALS_Example")
data = load_spark_df(spark, size="100k")
train, test = spark_random_split(data, ratio=0.75, seed=42)