Principle:Online ml River Factorization Machines

Knowledge Sources	Domains	Last Updated
Machine Learning Factorization Machines	Online_Learning, Recommender_Systems, Feature_Interactions	2026-02-08 18:00 GMT

Overview

Factorization machines (FMs) are a family of models that capture pairwise (and higher-order) feature interactions by embedding each feature into a low-dimensional latent space. They generalize matrix factorization and polynomial regression, and are especially effective for sparse, high-dimensional data common in recommendation systems and click-through rate prediction.

Description

Linear models capture only the direct effect of each feature on the target. In many applications -- especially recommender systems and computational advertising -- the interactions between features are equally or more important. A naive approach of including all pairwise interaction terms creates $O (d^{2})$ parameters for $d$ features, which is infeasible for high-dimensional sparse data.

Factorization Machines solve this by representing each feature $i$ with a latent vector $v_{i} \in ℝ^{k}$ where $k ≪ d$ . The interaction between features $i$ and $j$ is modeled as the dot product $⟨ v_{i}, v_{j} ⟩$ , reducing the number of interaction parameters from $O (d^{2})$ to $O (d k)$ .

Several variants extend the basic FM:

FM (Factorization Machine): Standard second-order FM with shared latent vectors.
FFM (Field-aware FM): Each feature has a separate latent vector for each field, capturing field-specific interaction patterns.
FwFM (Field-weighted FM): Adds a field-pair weight matrix that scales interactions between fields.
HOFM (Higher-Order FM): Generalizes to interactions of order 3 and beyond using ANOVA kernel decomposition.

Usage

Use factorization machines when:

You have sparse, high-dimensional categorical features (e.g., user-item interactions).
Pairwise feature interactions are important but explicit enumeration is infeasible.
You need an online model that can update incrementally with each new observation.
You are working on recommendation, click-through rate prediction, or similar tasks.

Theoretical Basis

Standard FM (Order 2)

The FM prediction for an input vector $x$ is:

hat{y}(x) = w_0 + sum_{i=1}^{d} w_i * x_i
             + sum_{i=1}^{d} sum_{j=i+1}^{d} <v_i, v_j> * x_i * x_j

where:
  w_0 = global bias
  w_i = weight for feature i
  v_i in R^k = latent vector for feature i
  <v_i, v_j> = sum_{f=1}^{k} v_{i,f} * v_{j,f}

The key computational trick is that the pairwise sum can be computed in $O (d k)$ time:

sum_{i<j} <v_i, v_j> * x_i * x_j
  = (1/2) * sum_{f=1}^{k} [ (sum_i v_{i,f} * x_i)^2 - sum_i v_{i,f}^2 * x_i^2 ]

Field-Aware FM (FFM)

In FFM, each feature $i$ belonging to field $f_{i}$ has a separate latent vector for each other field:

hat{y}(x) = w_0 + sum_i w_i * x_i
             + sum_{i<j} <v_{i,f_j}, v_{j,f_i}> * x_i * x_j

This allows the model to learn that feature $i$ interacts differently with features from different fields.

Online Learning

All FM variants are trained incrementally using stochastic gradient descent. For each observation $(x, y)$ , the gradient of the loss with respect to each parameter is computed and a single gradient step is taken, making FMs naturally suited to the online learning setting.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment