Implementation:Recommenders team Recommenders SARPlus

Knowledge Sources	Recommenders
Domains	Collaborative Filtering, Recommendation Systems, Spark
Last Updated	2026-02-10 00:00 GMT

Overview

SARPlus is a PySpark implementation of the Simple Algorithm for Recommendation (SAR) that computes item-item similarity matrices and generates top-K recommendations using Spark SQL, with an optional C++ accelerated prediction path.

Description

The SARPlus class provides a scalable implementation of the SAR recommendation algorithm built on top of Apache Spark. It supports three similarity metrics for computing item-item relationships: cooccurrence, Jaccard, and lift. The algorithm works by first computing a user-item affinity matrix from interaction data, then deriving item-item similarity from co-occurrence patterns among users.

The fit process constructs an item co-occurrence matrix using Spark SQL self-joins on the training data, filtered by a configurable occurrence threshold. When time decay is enabled, ratings are weighted using an exponential decay function based on the elapsed time since each interaction, controlled by a half-life parameter specified in days.

SARPlus offers two distinct prediction paths. The slow path computes recommendations entirely within Spark SQL by multiplying user affinity scores against the item similarity matrix and ranking results via window functions. The fast path leverages a C++ backed prediction engine through a pandas UDF, which memory-maps the similarity matrix from disk for efficient access across Spark worker processes. The fast path requires a local or distributed file system cache directory.

Additional methods provide popularity-based top-K item retrieval using item frequency counts extracted from the diagonal of the co-occurrence matrix, and user similarity computation based on user affinity dot products.

Usage

Use SARPlus when you need to generate item-based collaborative filtering recommendations on large-scale datasets that require distributed processing with Apache Spark. It is particularly suited for scenarios where interaction data is too large to fit in memory on a single node, and where the simplicity and interpretability of co-occurrence-based recommendations is preferred over deep learning approaches. The fast C++ prediction path is recommended for production workloads on Databricks or Azure Synapse.

Code Reference

Source Location

Repository: Recommenders
File: contrib/sarplus/python/pysarplus/SARPlus.py
Lines: 1-595

Signature

class SARPlus:
    def __init__(
        self,
        spark,
        col_user="userID",
        col_item="itemID",
        col_rating="rating",
        col_timestamp="timestamp",
        table_prefix="",
        similarity_type="jaccard",
        time_decay_coefficient=30,
        time_now=None,
        timedecay_formula=False,
        threshold=1,
        cache_path=None,
    ): ...

    def fit(self, df): ...

    def recommend_k_items(
        self,
        test,
        top_k=10,
        remove_seen=True,
        use_cache=False,
        n_user_prediction_partitions=200,
    ): ...

    def get_user_affinity(self, test): ...

    def get_topk_most_similar_users(self, test, user, top_k=10): ...

    def get_popularity_based_topk(self, top_k=10, items=True): ...

Import

from pysarplus.SARPlus import SARPlus

I/O Contract

Inputs

Name	Type	Required	Description
spark	pyspark.sql.SparkSession	Yes	Active Spark session used for executing SQL queries and managing temporary views
col_user	str	No	Column name for user identifiers in the input DataFrame (default: "userID")
col_item	str	No	Column name for item identifiers in the input DataFrame (default: "itemID")
col_rating	str	No	Column name for rating values in the input DataFrame (default: "rating")
col_timestamp	str	No	Column name for timestamps in the input DataFrame (default: "timestamp")
table_prefix	str	No	Prefix for generated Spark SQL temporary table names to avoid collisions
similarity_type	str	No	Similarity metric: "cooccurrence", "jaccard", or "lift" (default: "jaccard")
time_decay_coefficient	float	No	Half-life in days for time decay of ratings (default: 30)
time_now	int or None	No	Reference timestamp for time decay computation; if None, max timestamp in data is used
timedecay_formula	bool	No	Whether to apply time decay weighting to ratings (default: False)
threshold	int	No	Minimum co-occurrence count for item pairs to be included in similarity (default: 1)
cache_path	str or None	No	Local or DBFS path for caching similarity matrix for C++ fast predictions
df	pyspark.sql.DataFrame	Yes	Training DataFrame passed to fit(), containing user, item, rating, and optionally timestamp columns
test	pyspark.sql.DataFrame	Yes	Test DataFrame passed to recommend_k_items(), containing users to generate recommendations for
top_k	int	No	Number of top items to recommend per user (default: 10)

Outputs

Name	Type	Description
recommend_k_items return	pyspark.sql.DataFrame	DataFrame with columns for user ID, item ID, and prediction score for top-K recommendations
get_user_affinity return	pyspark.sql.DataFrame	DataFrame with user, item, and rating columns representing each test user's historical affinities
get_topk_most_similar_users return	pyspark.sql.DataFrame	DataFrame with user ID and similarity score for the top-K most similar users
get_popularity_based_topk return	pyspark.sql.DataFrame	DataFrame with item ID and frequency for the top-K most popular items

Usage Examples

Basic Usage

from pysarplus.SARPlus import SARPlus
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("SARPlusExample").getOrCreate()

# Create a SARPlus model with Jaccard similarity
model = SARPlus(
    spark,
    col_user="userID",
    col_item="itemID",
    col_rating="rating",
    col_timestamp="timestamp",
    similarity_type="jaccard",
    timedecay_formula=True,
    time_decay_coefficient=30,
    threshold=1,
)

# Fit the model on training data
model.fit(train_df)

# Generate top-10 recommendations for test users (slow path)
recommendations = model.recommend_k_items(test_df, top_k=10, remove_seen=False)
recommendations.show()

Fast Prediction with C++ Cache

from pysarplus.SARPlus import SARPlus

# Initialize with a cache path for C++ accelerated predictions
model = SARPlus(
    spark,
    col_user="userID",
    col_item="itemID",
    col_rating="rating",
    similarity_type="lift",
    cache_path="/tmp/sar_cache",
)

model.fit(train_df)

# Use the fast C++ backed prediction path
recommendations = model.recommend_k_items(
    test_df,
    top_k=10,
    remove_seen=True,
    use_cache=True,
    n_user_prediction_partitions=200,
)

Popularity and User Similarity

# After fitting, retrieve most popular items
popular_items = model.get_popularity_based_topk(top_k=20)
popular_items.show()

# Find users most similar to a target user
similar_users = model.get_topk_most_similar_users(test_df, user=42, top_k=5)
similar_users.show()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment