Workflow:Recommenders team Recommenders Algorithm Benchmarking
| Knowledge Sources | |
|---|---|
| Domains | Recommendation_Systems, Model_Evaluation, Benchmarking |
| Last Updated | 2026-02-09 23:00 GMT |
Overview
End-to-end process for systematically benchmarking and comparing multiple recommendation algorithms on the MovieLens dataset using standardized evaluation metrics.
Description
This workflow implements a systematic algorithm comparison framework that evaluates multiple recommendation models under identical conditions. It uses a standardized data preparation pipeline, consistent train-test splitting, and uniform evaluation metrics to produce a fair comparison table. The benchmark covers collaborative filtering algorithms across different computational backends (Python/pandas, PySpark, TensorFlow, PyTorch) including SAR, ALS, NCF, SVD, BPR, BiVAE, LightGCN, and EmbeddingDotBias. Each algorithm is trained, timed, and evaluated on both ranking metrics (MAP, NDCG, Precision, Recall) and rating metrics (RMSE, MAE, R-squared) to identify the best-performing approach for a given scenario.
Usage
Execute this workflow when you need to select the best recommendation algorithm for your use case from among several candidates. It is appropriate when starting a new recommendation project and need empirical evidence to choose between algorithms, or when evaluating whether a new algorithm outperforms existing baselines. The benchmark provides a standardized comparison methodology that controls for data, splitting strategy, and evaluation procedure.
Execution Steps
Step 1: Data Loading and Standardization
Load the MovieLens dataset and prepare it in the standard schema format (userID, itemID, rating, timestamp). Select the dataset size variant appropriate for the benchmark scope. Establish the common data representation that all algorithms will consume.
Key considerations:
- Use the same dataset size across all algorithms for fair comparison
- MovieLens 100k is fast for quick comparisons; 1M and 10M provide more robust results
- Ensure consistent column naming across all algorithm pipelines
Step 2: Stratified Data Splitting
Split the dataset using a stratified strategy with a fixed random seed to ensure all algorithms train and evaluate on identical data. The stratified split guarantees every user appears in both training and test sets, enabling per-user metric computation and preventing cold-start evaluation bias.
Key considerations:
- Use the same random seed across all algorithm evaluations
- 75/25 train-test ratio is the standard for this benchmark
- Stratified splitting is preferred over random for user-level fairness
Step 3: Algorithm-Specific Data Preparation
Transform the standardized training data into the format required by each specific algorithm. Different algorithms require different data representations: pandas DataFrames for SAR and Surprise SVD, Spark DataFrames for ALS, indexed datasets for NCF, Cornac Dataset objects for BPR and BiVAE, and file-based formats for deep learning models. Each preparation function handles the conversion while preserving the same underlying data.
What happens per algorithm type:
- SAR and Surprise: Direct pandas DataFrame usage
- ALS: Conversion to Spark DataFrame with explicit schema
- NCF: CSV serialization and index building via NCFDataset
- Cornac: Conversion to Cornac Dataset format with rating matrix
- LightGCN: Preparation of graph-based adjacency data
- EmbDotBias: PyTorch DataLoader construction
Step 4: Model Training with Timing
Train each algorithm with empirically-validated hyperparameters from the literature. Wrap each training call with a Timer to capture wall-clock training duration. Use consistent hyperparameter choices reported in benchmark literature to ensure reproducibility. Record the trained model and elapsed time for each algorithm.
Key considerations:
- Use literature-reported hyperparameters for fair comparison
- Timer captures both CPU and wall-clock time
- Some algorithms (SAR) are much faster to train than others (LightGCN)
- GPU-accelerated models should use the same GPU device for comparison
Step 5: Prediction and Recommendation Generation
Generate predictions from each trained model on the test set. For ranking evaluation, generate top-k recommendation lists per user. For rating evaluation, generate predicted ratings for test user-item pairs. Time the inference phase separately from training to understand operational cost. Remove already-seen items from recommendation lists.
Key considerations:
- Use k=10 as the standard top-k cutoff
- Remove items from the training set to evaluate discovery of new items
- Inference time matters for production deployment decisions
- Some algorithms support batch prediction; others require per-user scoring
Step 6: Comprehensive Metric Evaluation
Compute a standardized set of evaluation metrics for each algorithm. Ranking metrics include MAP (Mean Average Precision), NDCG@k (Normalized Discounted Cumulative Gain), Precision@k, and Recall@k. Rating metrics include RMSE (Root Mean Square Error), MAE (Mean Absolute Error), R-squared, and Explained Variance. Compile all results into a comparison table.
Key considerations:
- Not all algorithms produce rating predictions (some only produce rankings)
- MAP and NDCG are the most commonly compared ranking metrics
- RMSE and MAE are the most commonly compared rating metrics
- Statistical significance testing can augment the comparison
Step 7: Results Compilation and Analysis
Compile all metrics, training times, and prediction times into a comprehensive comparison table. Analyze which algorithms excel at ranking vs rating tasks, and which offer the best tradeoff between accuracy and computational cost. Document the hardware specifications and software versions used to ensure reproducibility.
Key considerations:
- No single algorithm dominates all metrics; the best choice depends on the use case
- Consider both accuracy and computational cost for production decisions
- Document exact versions of all dependencies for reproducibility
- Results should be compared against the benchmark table in the repository README