Principle:Online ml River Factorization Machines
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| Machine Learning Factorization Machines | Online_Learning, Recommender_Systems, Feature_Interactions | 2026-02-08 18:00 GMT |
Overview
Factorization machines (FMs) are a family of models that capture pairwise (and higher-order) feature interactions by embedding each feature into a low-dimensional latent space. They generalize matrix factorization and polynomial regression, and are especially effective for sparse, high-dimensional data common in recommendation systems and click-through rate prediction.
Description
Linear models capture only the direct effect of each feature on the target. In many applications -- especially recommender systems and computational advertising -- the interactions between features are equally or more important. A naive approach of including all pairwise interaction terms creates parameters for features, which is infeasible for high-dimensional sparse data.
Factorization Machines solve this by representing each feature with a latent vector where . The interaction between features and is modeled as the dot product , reducing the number of interaction parameters from to .
Several variants extend the basic FM:
- FM (Factorization Machine): Standard second-order FM with shared latent vectors.
- FFM (Field-aware FM): Each feature has a separate latent vector for each field, capturing field-specific interaction patterns.
- FwFM (Field-weighted FM): Adds a field-pair weight matrix that scales interactions between fields.
- HOFM (Higher-Order FM): Generalizes to interactions of order 3 and beyond using ANOVA kernel decomposition.
Usage
Use factorization machines when:
- You have sparse, high-dimensional categorical features (e.g., user-item interactions).
- Pairwise feature interactions are important but explicit enumeration is infeasible.
- You need an online model that can update incrementally with each new observation.
- You are working on recommendation, click-through rate prediction, or similar tasks.
Theoretical Basis
Standard FM (Order 2)
The FM prediction for an input vector is:
hat{y}(x) = w_0 + sum_{i=1}^{d} w_i * x_i
+ sum_{i=1}^{d} sum_{j=i+1}^{d} <v_i, v_j> * x_i * x_j
where:
w_0 = global bias
w_i = weight for feature i
v_i in R^k = latent vector for feature i
<v_i, v_j> = sum_{f=1}^{k} v_{i,f} * v_{j,f}
The key computational trick is that the pairwise sum can be computed in time:
sum_{i<j} <v_i, v_j> * x_i * x_j
= (1/2) * sum_{f=1}^{k} [ (sum_i v_{i,f} * x_i)^2 - sum_i v_{i,f}^2 * x_i^2 ]
Field-Aware FM (FFM)
In FFM, each feature belonging to field has a separate latent vector for each other field:
hat{y}(x) = w_0 + sum_i w_i * x_i
+ sum_{i<j} <v_{i,f_j}, v_{j,f_i}> * x_i * x_j
This allows the model to learn that feature interacts differently with features from different fields.
Online Learning
All FM variants are trained incrementally using stochastic gradient descent. For each observation , the gradient of the loss with respect to each parameter is computed and a single gradient step is taken, making FMs naturally suited to the online learning setting.