Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Pytorch Serve Recommendation Model Serving

From Leeroopedia
Field Value
source Pytorch_Serve
domains Recommendation, Distributed_Computing
last_updated 2026-02-13 18:52 GMT

Overview

Recommendation Model Serving is the principle of deploying deep learning recommendation models (DLRMs) that perform sparse embedding lookups, distributed sharding of embedding tables, and dense feature interaction computations to generate personalized recommendations at serving time.

Description

This principle addresses what is involved in serving deep learning recommendation models in production. Recommendation models are architecturally distinct from other deep learning models because they process both sparse categorical features (user IDs, item IDs, categories) and dense numerical features (age, price, click-through rates) through separate pathways that are combined via feature interaction layers.

The key components of DLRM serving include:

  • Embedding tables -- Large lookup tables that map sparse categorical features to dense vector representations. These tables can consume hundreds of gigabytes of memory, requiring distributed sharding across multiple devices or hosts.
  • Bottom MLP -- A multi-layer perceptron that processes dense numerical features into a representation of the same dimensionality as the embedding vectors.
  • Feature interaction layer -- Computes pairwise interactions (typically dot products) between all embedding vectors and the dense feature representation.
  • Top MLP -- Processes the concatenated interaction features and produces the final prediction (e.g., click-through probability).
  • Factory pattern -- A DLRMFactory abstraction creates and configures DLRM instances with appropriate sharding strategies based on available hardware.
from torchrec.models.dlrm import DLRM
from torchrec.distributed.model_parallel import DistributedModelParallel
from torchrec.distributed.planner import EmbeddingShardingPlanner

# Configure embedding tables with sharding
embedding_config = [
    EmbeddingBagConfig(
        name="user_embedding",
        embedding_dim=64,
        num_embeddings=10_000_000,
        feature_names=["user_id"]
    ),
    EmbeddingBagConfig(
        name="item_embedding",
        embedding_dim=64,
        num_embeddings=1_000_000,
        feature_names=["item_id"]
    ),
]

# Create sharded model via factory
model = DLRM(
    embedding_bag_collection=EmbeddingBagCollection(tables=embedding_config),
    dense_in_features=13,
    dense_arch_layer_sizes=[512, 256, 64],
    over_arch_layer_sizes=[512, 256, 1],
)

Usage

Apply this principle when:

  • The model contains large embedding tables that exceed the memory capacity of a single device.
  • Personalized ranking or click-through-rate prediction is required at serving time.
  • The input data contains a mix of sparse categorical and dense numerical features.
  • Low-latency inference is critical (e.g., real-time ad ranking, product recommendation).
  • The deployment requires model-parallel distribution of embedding tables across GPUs or hosts via TorchRec's sharding primitives.
  • A factory abstraction is needed to instantiate models with different configurations for different serving environments.

Theoretical Basis

The Deep Learning Recommendation Model (DLRM) architecture processes heterogeneous features through a two-tower design:

Sparse feature processing:

  1. Each categorical feature c_i is mapped to a dense vector via an embedding table lookup: e_i = E_i[c_i].
  2. For multi-valued features, an EmbeddingBag operation performs lookup and pooling (sum or mean) in a single fused operation.

Dense feature processing:

  1. Dense features x = [x_1, x_2, ..., x_d] are passed through the bottom MLP: z = MLP_bottom(x).
  2. The output dimensionality matches the embedding dimension.

Feature interaction:

  1. All vectors [z, e_1, e_2, ..., e_k] are collected into a matrix.
  2. Pairwise dot products compute interactions: I_{ij} = v_i^T v_j.
  3. Only the upper triangle of the interaction matrix is retained (avoiding redundancy).
  4. The interaction features are concatenated with the dense representation.

Prediction:

  1. The top MLP processes the concatenated features: y = sigmoid(MLP_top(concat(z, I))).
  2. The output is a probability score used for ranking.

Embedding table sharding distributes tables across devices using strategies including:

  • Table-wise sharding -- Each table resides on a single device.
  • Row-wise sharding -- Rows of a single table are split across devices.
  • Column-wise sharding -- Embedding dimensions are split across devices.
  • An automatic planner selects the optimal strategy based on table sizes and device memory constraints.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment