Implementation:Recommenders team Recommenders PySpark ALS
| Knowledge Sources | |
|---|---|
| Domains | Matrix Factorization, Recommendation Systems, Distributed Computing |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for training an Alternating Least Squares matrix factorization model using PySpark's distributed ML library, producing a fitted ALSModel that can generate rating predictions.
Description
This is a wrapper document for PySpark's external pyspark.ml.recommendation.ALS API, documenting how it is used within the Recommenders repository's ALS workflow. The ALS class is a Spark ML Estimator that implements distributed ALS matrix factorization. It is configured with hyperparameters (rank, regularization, iterations) and column mappings (user, item, rating columns), then fitted on a training DataFrame to produce an ALSModel. The model contains the learned user and item factor matrices and can be used to predict ratings for arbitrary user-item pairs.
In the Recommenders context, ALS is typically configured with coldStartStrategy="drop" to handle users or items that appear in the test set but not the training set. This prevents NaN predictions from corrupting downstream evaluation metrics.
Usage
Instantiate the ALS estimator after splitting the data into train/test sets. Call .fit(train_df) to train the model. The returned ALSModel is then used with .transform(test_df) to generate predictions for evaluation.
Code Reference
Source Location
- Repository: External PySpark API
- Package:
pyspark.ml.recommendation
Signature
# Estimator configuration
als = ALS(
rank=10,
maxIter=15,
regParam=0.05,
userCol="userID",
itemCol="itemID",
ratingCol="rating",
coldStartStrategy="drop",
)
# Model training
model = als.fit(train_df) # Returns ALSModel
Import
from pyspark.ml.recommendation import ALS
External Reference
- Official Documentation: pyspark.ml.recommendation.ALS
- Spark ML Guide: Collaborative Filtering - Spark MLlib
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| rank | int | No (default: 10) | Number of latent factors (dimensionality of user and item vectors) |
| maxIter | int | No (default: 15) | Maximum number of ALS iterations (alternations between fixing U and V) |
| regParam | float | No (default: 0.05) | Regularization parameter (lambda) to prevent overfitting |
| userCol | str | No (default: "userID") | Name of the column containing user identifiers |
| itemCol | str | No (default: "itemID") | Name of the column containing item identifiers |
| ratingCol | str | No (default: "rating") | Name of the column containing ratings or interaction values |
| coldStartStrategy | str | No (default: "drop") | Strategy for handling unknown users/items at prediction time; "drop" removes NaN predictions, "nan" keeps them
|
| implicitPrefs | bool | No (default: False) | If True, uses implicit feedback ALS with confidence weighting |
| alpha | float | No (default: 1.0) | Confidence scaling parameter for implicit feedback (only used when implicitPrefs=True)
|
| train_df | pyspark.sql.DataFrame | Yes (for .fit()) | Training DataFrame containing user, item, and rating columns |
Outputs
| Name | Type | Description |
|---|---|---|
| model | pyspark.ml.recommendation.ALSModel | Fitted model containing learned user factor matrix U and item factor matrix V; exposes .transform() for prediction and .userFactors / .itemFactors for accessing the latent vectors
|
Usage Examples
Basic ALS Training
from pyspark.ml.recommendation import ALS
# Configure ALS estimator
als = ALS(
rank=10,
maxIter=15,
regParam=0.05,
userCol="userID",
itemCol="itemID",
ratingCol="rating",
coldStartStrategy="drop",
)
# Train the model
model = als.fit(train_df)
# Access learned factor matrices
print(f"User factors shape: {model.userFactors.count()} users x {model.rank} factors")
print(f"Item factors shape: {model.itemFactors.count()} items x {model.rank} factors")
ALS with Implicit Feedback
from pyspark.ml.recommendation import ALS
als = ALS(
rank=20,
maxIter=15,
regParam=0.1,
userCol="userID",
itemCol="itemID",
ratingCol="count",
implicitPrefs=True,
alpha=40.0,
coldStartStrategy="drop",
)
model = als.fit(implicit_train_df)
Full Workflow in Recommenders Context
from recommenders.utils.spark_utils import start_or_get_spark
from recommenders.datasets.movielens import load_spark_df
from recommenders.datasets.spark_splitters import spark_random_split
from pyspark.ml.recommendation import ALS
# Setup
spark = start_or_get_spark(app_name="ALS_Example", memory="16g")
data = load_spark_df(spark, size="100k")
train, test = spark_random_split(data, ratio=0.75, seed=42)
# Train ALS model
als = ALS(
rank=10,
maxIter=15,
regParam=0.05,
userCol="userID",
itemCol="itemID",
ratingCol="rating",
coldStartStrategy="drop",
)
model = als.fit(train)