Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Recommenders team Recommenders Benchmark Prepare Training

From Leeroopedia


Knowledge Sources
Domains Recommender Systems, Benchmarking, Data Preparation
Last Updated 2026-02-10 00:00 GMT

Overview

Concrete tool for converting common pandas DataFrames into algorithm-specific training data formats used in the benchmarking workflow.

Description

The prepare_training_* family of functions in benchmark_utils.py provides a uniform adapter interface for data preparation across all benchmarked algorithms. Each function accepts the same pair of pandas DataFrames (train, test) and returns the algorithm-native data structure required for model training. This enables the benchmark loop to call data preparation generically via a dispatch dictionary without algorithm-specific branching logic.

The functions handle all format-specific concerns internally:

  • prepare_training_sar: Returns the training DataFrame unchanged (identity transform).
  • prepare_training_als: Creates a typed PySpark DataFrame with an explicit schema (IntegerType for user/item, FloatType for rating, LongType for timestamp).
  • prepare_training_svd: Builds a Surprise Trainset by loading from the DataFrame with a specified rating scale.
  • prepare_training_ncf: Sorts data by user, filters test to known users/items, writes CSV files, and constructs an NCFDataset.
  • prepare_training_cornac: Converts to Cornac UIR format (used by both BPR and BiVAE).
  • prepare_training_embdotbias: Casts user/item columns to strings and creates a RecoDataLoader with a 10% validation split.
  • prepare_training_lightgcn: Wraps both train and test DataFrames in an ImplicitCF object for graph-based learning.

Usage

Use these functions when setting up a multi-algorithm benchmark. Each function is registered in a dispatch dictionary keyed by algorithm name, allowing the benchmark loop to call the appropriate preparation function generically.

Code Reference

Source Location

  • Repository: recommenders
  • File: examples/06_benchmarks/benchmark_utils.py (Lines 76-384)

Signature

def prepare_training_sar(train, test) -> pd.DataFrame

def prepare_training_als(train, test) -> pyspark.sql.DataFrame

def prepare_training_svd(train, test) -> surprise.Trainset

def prepare_training_ncf(df_train, df_test) -> NCFDataset

def prepare_training_cornac(train, test) -> cornac.data.Dataset

def prepare_training_embdotbias(train, test) -> RecoDataLoader

def prepare_training_lightgcn(train, test) -> ImplicitCF

Import

import sys
sys.path.append("examples/06_benchmarks")
from benchmark_utils import (
    prepare_training_sar,
    prepare_training_als,
    prepare_training_svd,
    prepare_training_ncf,
    prepare_training_cornac,
    prepare_training_embdotbias,
    prepare_training_lightgcn,
)

I/O Contract

Function Input: train Input: test Output Type Notes
prepare_training_sar pd.DataFrame pd.DataFrame (unused) pd.DataFrame Identity; returns train as-is
prepare_training_als pd.DataFrame pd.DataFrame (unused) pyspark.sql.DataFrame Creates typed Spark schema (Int user/item, Float rating, Long timestamp)
prepare_training_svd pd.DataFrame pd.DataFrame (unused) surprise.Trainset Drops timestamp column; uses rating_scale=(1, 5)
prepare_training_ncf pd.DataFrame pd.DataFrame NCFDataset Sorts by user; filters test to known users/items; writes temp CSV files
prepare_training_cornac pd.DataFrame pd.DataFrame (unused) cornac.data.Dataset Drops timestamp; converts to UIR triplets; used by BPR and BiVAE
prepare_training_embdotbias pd.DataFrame pd.DataFrame (unused) RecoDataLoader Casts user/item to str; creates 90/10 train/valid split
prepare_training_lightgcn pd.DataFrame pd.DataFrame ImplicitCF Wraps both train and test into an ImplicitCF graph structure

All functions expect the input DataFrames to contain the standard columns: userID, itemID, rating, and timestamp (as defined by DEFAULT_USER_COL, DEFAULT_ITEM_COL, DEFAULT_RATING_COL, DEFAULT_TIMESTAMP_COL).

Usage Examples

from benchmark_utils import (
    prepare_training_sar,
    prepare_training_als,
    prepare_training_svd,
    prepare_training_ncf,
    prepare_training_cornac,
    prepare_training_embdotbias,
    prepare_training_lightgcn,
)

# Build a dispatch dictionary for the benchmark loop
prepare_training_data = {
    "als": prepare_training_als,
    "sar": prepare_training_sar,
    "svd": prepare_training_svd,
    "embdotbias": prepare_training_embdotbias,
    "ncf": prepare_training_ncf,
    "bpr": prepare_training_cornac,
    "bivae": prepare_training_cornac,
    "lightgcn": prepare_training_lightgcn,
}

# In the benchmark loop, call generically:
for algo in algorithms:
    train_data = prepare_training_data[algo](df_train, df_test)
    model, time_train = trainer[algo](params[algo], train_data)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment