Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Recommenders team Recommenders SARPlus

From Leeroopedia
Revision as of 16:29, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Recommenders_team_Recommenders_SARPlus.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Collaborative Filtering, Recommendation Systems, Spark
Last Updated 2026-02-10 00:00 GMT

Overview

SARPlus is a PySpark implementation of the Simple Algorithm for Recommendation (SAR) that computes item-item similarity matrices and generates top-K recommendations using Spark SQL, with an optional C++ accelerated prediction path.

Description

The SARPlus class provides a scalable implementation of the SAR recommendation algorithm built on top of Apache Spark. It supports three similarity metrics for computing item-item relationships: cooccurrence, Jaccard, and lift. The algorithm works by first computing a user-item affinity matrix from interaction data, then deriving item-item similarity from co-occurrence patterns among users.

The fit process constructs an item co-occurrence matrix using Spark SQL self-joins on the training data, filtered by a configurable occurrence threshold. When time decay is enabled, ratings are weighted using an exponential decay function based on the elapsed time since each interaction, controlled by a half-life parameter specified in days.

SARPlus offers two distinct prediction paths. The slow path computes recommendations entirely within Spark SQL by multiplying user affinity scores against the item similarity matrix and ranking results via window functions. The fast path leverages a C++ backed prediction engine through a pandas UDF, which memory-maps the similarity matrix from disk for efficient access across Spark worker processes. The fast path requires a local or distributed file system cache directory.

Additional methods provide popularity-based top-K item retrieval using item frequency counts extracted from the diagonal of the co-occurrence matrix, and user similarity computation based on user affinity dot products.

Usage

Use SARPlus when you need to generate item-based collaborative filtering recommendations on large-scale datasets that require distributed processing with Apache Spark. It is particularly suited for scenarios where interaction data is too large to fit in memory on a single node, and where the simplicity and interpretability of co-occurrence-based recommendations is preferred over deep learning approaches. The fast C++ prediction path is recommended for production workloads on Databricks or Azure Synapse.

Code Reference

Source Location

Signature

class SARPlus:
    def __init__(
        self,
        spark,
        col_user="userID",
        col_item="itemID",
        col_rating="rating",
        col_timestamp="timestamp",
        table_prefix="",
        similarity_type="jaccard",
        time_decay_coefficient=30,
        time_now=None,
        timedecay_formula=False,
        threshold=1,
        cache_path=None,
    ): ...

    def fit(self, df): ...

    def recommend_k_items(
        self,
        test,
        top_k=10,
        remove_seen=True,
        use_cache=False,
        n_user_prediction_partitions=200,
    ): ...

    def get_user_affinity(self, test): ...

    def get_topk_most_similar_users(self, test, user, top_k=10): ...

    def get_popularity_based_topk(self, top_k=10, items=True): ...

Import

from pysarplus.SARPlus import SARPlus

I/O Contract

Inputs

Name Type Required Description
spark pyspark.sql.SparkSession Yes Active Spark session used for executing SQL queries and managing temporary views
col_user str No Column name for user identifiers in the input DataFrame (default: "userID")
col_item str No Column name for item identifiers in the input DataFrame (default: "itemID")
col_rating str No Column name for rating values in the input DataFrame (default: "rating")
col_timestamp str No Column name for timestamps in the input DataFrame (default: "timestamp")
table_prefix str No Prefix for generated Spark SQL temporary table names to avoid collisions
similarity_type str No Similarity metric: "cooccurrence", "jaccard", or "lift" (default: "jaccard")
time_decay_coefficient float No Half-life in days for time decay of ratings (default: 30)
time_now int or None No Reference timestamp for time decay computation; if None, max timestamp in data is used
timedecay_formula bool No Whether to apply time decay weighting to ratings (default: False)
threshold int No Minimum co-occurrence count for item pairs to be included in similarity (default: 1)
cache_path str or None No Local or DBFS path for caching similarity matrix for C++ fast predictions
df pyspark.sql.DataFrame Yes Training DataFrame passed to fit(), containing user, item, rating, and optionally timestamp columns
test pyspark.sql.DataFrame Yes Test DataFrame passed to recommend_k_items(), containing users to generate recommendations for
top_k int No Number of top items to recommend per user (default: 10)

Outputs

Name Type Description
recommend_k_items return pyspark.sql.DataFrame DataFrame with columns for user ID, item ID, and prediction score for top-K recommendations
get_user_affinity return pyspark.sql.DataFrame DataFrame with user, item, and rating columns representing each test user's historical affinities
get_topk_most_similar_users return pyspark.sql.DataFrame DataFrame with user ID and similarity score for the top-K most similar users
get_popularity_based_topk return pyspark.sql.DataFrame DataFrame with item ID and frequency for the top-K most popular items

Usage Examples

Basic Usage

from pysarplus.SARPlus import SARPlus
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("SARPlusExample").getOrCreate()

# Create a SARPlus model with Jaccard similarity
model = SARPlus(
    spark,
    col_user="userID",
    col_item="itemID",
    col_rating="rating",
    col_timestamp="timestamp",
    similarity_type="jaccard",
    timedecay_formula=True,
    time_decay_coefficient=30,
    threshold=1,
)

# Fit the model on training data
model.fit(train_df)

# Generate top-10 recommendations for test users (slow path)
recommendations = model.recommend_k_items(test_df, top_k=10, remove_seen=False)
recommendations.show()

Fast Prediction with C++ Cache

from pysarplus.SARPlus import SARPlus

# Initialize with a cache path for C++ accelerated predictions
model = SARPlus(
    spark,
    col_user="userID",
    col_item="itemID",
    col_rating="rating",
    similarity_type="lift",
    cache_path="/tmp/sar_cache",
)

model.fit(train_df)

# Use the fast C++ backed prediction path
recommendations = model.recommend_k_items(
    test_df,
    top_k=10,
    remove_seen=True,
    use_cache=True,
    n_user_prediction_partitions=200,
)

Popularity and User Similarity

# After fitting, retrieve most popular items
popular_items = model.get_popularity_based_topk(top_k=20)
popular_items.show()

# Find users most similar to a target user
similar_users = model.get_topk_most_similar_users(test_df, user=42, top_k=5)
similar_users.show()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment