Implementation:Recommenders team Recommenders Benchmark Prepare Training
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Benchmarking, Data Preparation |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for converting common pandas DataFrames into algorithm-specific training data formats used in the benchmarking workflow.
Description
The prepare_training_* family of functions in benchmark_utils.py provides a uniform adapter interface for data preparation across all benchmarked algorithms. Each function accepts the same pair of pandas DataFrames (train, test) and returns the algorithm-native data structure required for model training. This enables the benchmark loop to call data preparation generically via a dispatch dictionary without algorithm-specific branching logic.
The functions handle all format-specific concerns internally:
- prepare_training_sar: Returns the training DataFrame unchanged (identity transform).
- prepare_training_als: Creates a typed PySpark DataFrame with an explicit schema (IntegerType for user/item, FloatType for rating, LongType for timestamp).
- prepare_training_svd: Builds a Surprise Trainset by loading from the DataFrame with a specified rating scale.
- prepare_training_ncf: Sorts data by user, filters test to known users/items, writes CSV files, and constructs an NCFDataset.
- prepare_training_cornac: Converts to Cornac UIR format (used by both BPR and BiVAE).
- prepare_training_embdotbias: Casts user/item columns to strings and creates a RecoDataLoader with a 10% validation split.
- prepare_training_lightgcn: Wraps both train and test DataFrames in an ImplicitCF object for graph-based learning.
Usage
Use these functions when setting up a multi-algorithm benchmark. Each function is registered in a dispatch dictionary keyed by algorithm name, allowing the benchmark loop to call the appropriate preparation function generically.
Code Reference
Source Location
- Repository: recommenders
- File:
examples/06_benchmarks/benchmark_utils.py(Lines 76-384)
Signature
def prepare_training_sar(train, test) -> pd.DataFrame
def prepare_training_als(train, test) -> pyspark.sql.DataFrame
def prepare_training_svd(train, test) -> surprise.Trainset
def prepare_training_ncf(df_train, df_test) -> NCFDataset
def prepare_training_cornac(train, test) -> cornac.data.Dataset
def prepare_training_embdotbias(train, test) -> RecoDataLoader
def prepare_training_lightgcn(train, test) -> ImplicitCF
Import
import sys
sys.path.append("examples/06_benchmarks")
from benchmark_utils import (
prepare_training_sar,
prepare_training_als,
prepare_training_svd,
prepare_training_ncf,
prepare_training_cornac,
prepare_training_embdotbias,
prepare_training_lightgcn,
)
I/O Contract
| Function | Input: train | Input: test | Output Type | Notes |
|---|---|---|---|---|
prepare_training_sar |
pd.DataFrame | pd.DataFrame (unused) | pd.DataFrame | Identity; returns train as-is |
prepare_training_als |
pd.DataFrame | pd.DataFrame (unused) | pyspark.sql.DataFrame | Creates typed Spark schema (Int user/item, Float rating, Long timestamp) |
prepare_training_svd |
pd.DataFrame | pd.DataFrame (unused) | surprise.Trainset | Drops timestamp column; uses rating_scale=(1, 5) |
prepare_training_ncf |
pd.DataFrame | pd.DataFrame | NCFDataset | Sorts by user; filters test to known users/items; writes temp CSV files |
prepare_training_cornac |
pd.DataFrame | pd.DataFrame (unused) | cornac.data.Dataset | Drops timestamp; converts to UIR triplets; used by BPR and BiVAE |
prepare_training_embdotbias |
pd.DataFrame | pd.DataFrame (unused) | RecoDataLoader | Casts user/item to str; creates 90/10 train/valid split |
prepare_training_lightgcn |
pd.DataFrame | pd.DataFrame | ImplicitCF | Wraps both train and test into an ImplicitCF graph structure |
All functions expect the input DataFrames to contain the standard columns: userID, itemID, rating, and timestamp (as defined by DEFAULT_USER_COL, DEFAULT_ITEM_COL, DEFAULT_RATING_COL, DEFAULT_TIMESTAMP_COL).
Usage Examples
from benchmark_utils import (
prepare_training_sar,
prepare_training_als,
prepare_training_svd,
prepare_training_ncf,
prepare_training_cornac,
prepare_training_embdotbias,
prepare_training_lightgcn,
)
# Build a dispatch dictionary for the benchmark loop
prepare_training_data = {
"als": prepare_training_als,
"sar": prepare_training_sar,
"svd": prepare_training_svd,
"embdotbias": prepare_training_embdotbias,
"ncf": prepare_training_ncf,
"bpr": prepare_training_cornac,
"bivae": prepare_training_cornac,
"lightgcn": prepare_training_lightgcn,
}
# In the benchmark loop, call generically:
for algo in algorithms:
train_data = prepare_training_data[algo](df_train, df_test)
model, time_train = trainer[algo](params[algo], train_data)