Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Recommenders team Recommenders Spark Random Split

From Leeroopedia


Knowledge Sources
Domains Data Engineering, Distributed Computing
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for splitting Spark DataFrames into train/test partitions using distributed random assignment with configurable ratios and reproducible seeding.

Description

The spark_random_split function wraps PySpark's native DataFrame.randomSplit() with a convenience interface that accepts either a single float ratio (for two-way splits) or a list of floats (for multi-way splits). It delegates ratio validation and normalization to the process_split_ratio utility, then calls the underlying Spark method which performs the split distributedly across cluster partitions.

When a single float is provided (e.g., 0.75), the function creates a two-way split with [0.75, 0.25] weights. When a list is provided (e.g., [0.6, 0.2, 0.2]), the split produces the corresponding number of DataFrames. The seed parameter ensures reproducibility across runs.

Usage

Call this function after loading data with load_spark_df and before training an ALS model. The first element of the returned list is typically used as the training set and the second as the test set. For multi-way splits, the intermediate element(s) serve as validation sets.

Code Reference

Source Location

  • Repository: recommenders
  • File: recommenders/datasets/spark_splitters.py (Lines 23-45)

Signature

def spark_random_split(
    data,
    ratio=0.75,
    seed=42,
) -> list[pyspark.sql.DataFrame]

Import

from recommenders.datasets.spark_splitters import spark_random_split

I/O Contract

Inputs

Name Type Required Description
data pyspark.sql.DataFrame Yes Spark DataFrame containing the interaction data to split
ratio float or list of float No (default: 0.75) Split ratio; a single float produces a two-way split, a list produces a multi-way split. Ratios are normalized if they do not sum to 1.0
seed int No (default: 42) Random seed for reproducible splitting

Outputs

Name Type Description
splits list[pyspark.sql.DataFrame] List of Spark DataFrames; for a two-way split returns [train_df, test_df], for multi-way returns [df_1, df_2, ..., df_k]

Usage Examples

Two-Way Train/Test Split

from recommenders.datasets.spark_splitters import spark_random_split

# Split 75% train, 25% test
train, test = spark_random_split(data, ratio=0.75, seed=42)

print(f"Train: {train.count()} rows")
print(f"Test:  {test.count()} rows")
# Train: ~75000 rows
# Test:  ~25000 rows

Multi-Way Split

from recommenders.datasets.spark_splitters import spark_random_split

# Split into train (60%), validation (20%), test (20%)
train, val, test = spark_random_split(data, ratio=[0.6, 0.2, 0.2], seed=42)

Full ALS Workflow Context

from recommenders.utils.spark_utils import start_or_get_spark
from recommenders.datasets.movielens import load_spark_df
from recommenders.datasets.spark_splitters import spark_random_split

spark = start_or_get_spark(app_name="ALS_Example")
data = load_spark_df(spark, size="100k")
train, test = spark_random_split(data, ratio=0.75, seed=42)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment