Implementation:Recommenders team Recommenders Benchmark Prepare Training

Knowledge Sources	Recommenders
Domains	Recommender Systems, Benchmarking, Data Preparation
Last Updated	2026-02-10 00:00 GMT

Overview

Concrete tool for converting common pandas DataFrames into algorithm-specific training data formats used in the benchmarking workflow.

Description

The prepare_training_* family of functions in benchmark_utils.py provides a uniform adapter interface for data preparation across all benchmarked algorithms. Each function accepts the same pair of pandas DataFrames (train, test) and returns the algorithm-native data structure required for model training. This enables the benchmark loop to call data preparation generically via a dispatch dictionary without algorithm-specific branching logic.

The functions handle all format-specific concerns internally:

prepare_training_sar: Returns the training DataFrame unchanged (identity transform).
prepare_training_als: Creates a typed PySpark DataFrame with an explicit schema (IntegerType for user/item, FloatType for rating, LongType for timestamp).
prepare_training_svd: Builds a Surprise Trainset by loading from the DataFrame with a specified rating scale.
prepare_training_ncf: Sorts data by user, filters test to known users/items, writes CSV files, and constructs an NCFDataset.
prepare_training_cornac: Converts to Cornac UIR format (used by both BPR and BiVAE).
prepare_training_embdotbias: Casts user/item columns to strings and creates a RecoDataLoader with a 10% validation split.
prepare_training_lightgcn: Wraps both train and test DataFrames in an ImplicitCF object for graph-based learning.

Usage

Use these functions when setting up a multi-algorithm benchmark. Each function is registered in a dispatch dictionary keyed by algorithm name, allowing the benchmark loop to call the appropriate preparation function generically.

Code Reference

Source Location

Repository: recommenders
File: examples/06_benchmarks/benchmark_utils.py (Lines 76-384)

Signature

def prepare_training_sar(train, test) -> pd.DataFrame

def prepare_training_als(train, test) -> pyspark.sql.DataFrame

def prepare_training_svd(train, test) -> surprise.Trainset

def prepare_training_ncf(df_train, df_test) -> NCFDataset

def prepare_training_cornac(train, test) -> cornac.data.Dataset

def prepare_training_embdotbias(train, test) -> RecoDataLoader

def prepare_training_lightgcn(train, test) -> ImplicitCF

Import

import sys
sys.path.append("examples/06_benchmarks")
from benchmark_utils import (
    prepare_training_sar,
    prepare_training_als,
    prepare_training_svd,
    prepare_training_ncf,
    prepare_training_cornac,
    prepare_training_embdotbias,
    prepare_training_lightgcn,
)

I/O Contract

Function	Input: train	Input: test	Output Type	Notes
`prepare_training_sar`	pd.DataFrame	pd.DataFrame (unused)	pd.DataFrame	Identity; returns train as-is
`prepare_training_als`	pd.DataFrame	pd.DataFrame (unused)	pyspark.sql.DataFrame	Creates typed Spark schema (Int user/item, Float rating, Long timestamp)
`prepare_training_svd`	pd.DataFrame	pd.DataFrame (unused)	surprise.Trainset	Drops timestamp column; uses rating_scale=(1, 5)
`prepare_training_ncf`	pd.DataFrame	pd.DataFrame	NCFDataset	Sorts by user; filters test to known users/items; writes temp CSV files
`prepare_training_cornac`	pd.DataFrame	pd.DataFrame (unused)	cornac.data.Dataset	Drops timestamp; converts to UIR triplets; used by BPR and BiVAE
`prepare_training_embdotbias`	pd.DataFrame	pd.DataFrame (unused)	RecoDataLoader	Casts user/item to str; creates 90/10 train/valid split
`prepare_training_lightgcn`	pd.DataFrame	pd.DataFrame	ImplicitCF	Wraps both train and test into an ImplicitCF graph structure

All functions expect the input DataFrames to contain the standard columns: userID, itemID, rating, and timestamp (as defined by DEFAULT_USER_COL, DEFAULT_ITEM_COL, DEFAULT_RATING_COL, DEFAULT_TIMESTAMP_COL).

Usage Examples

from benchmark_utils import (
    prepare_training_sar,
    prepare_training_als,
    prepare_training_svd,
    prepare_training_ncf,
    prepare_training_cornac,
    prepare_training_embdotbias,
    prepare_training_lightgcn,
)

# Build a dispatch dictionary for the benchmark loop
prepare_training_data = {
    "als": prepare_training_als,
    "sar": prepare_training_sar,
    "svd": prepare_training_svd,
    "embdotbias": prepare_training_embdotbias,
    "ncf": prepare_training_ncf,
    "bpr": prepare_training_cornac,
    "bivae": prepare_training_cornac,
    "lightgcn": prepare_training_lightgcn,
}

# In the benchmark loop, call generically:
for algo in algorithms:
    train_data = prepare_training_data[algo](df_train, df_test)
    model, time_train = trainer[algo](params[algo], train_data)

Related Pages

Implements Principle

Principle:Recommenders_team_Recommenders_Benchmark_Data_Preparation

Requires Environment

Environment:Recommenders_team_Recommenders_Python_Core_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment