Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Fastai Fastbook Embedding Size Rule

From Leeroopedia





Knowledge Sources
Domains Collaborative_Filtering, Tabular
Last Updated 2026-02-09 17:00 GMT

Overview

Use `get_emb_sz()` to automatically determine embedding dimensions based on cardinality; for deep learning collab filtering, user and item embeddings can have different sizes.

Description

When using embeddings to represent categorical variables (users, items, categories), the embedding dimension must be chosen. Too small an embedding cannot capture the underlying patterns; too large wastes memory and risks overfitting. The fastai library provides `get_emb_sz()`, which returns recommended embedding matrix sizes based on a heuristic derived from empirical experimentation. For the MovieLens dataset example, this produces embedding sizes of 74 for users (944 categories) and 101 for movies (1635 categories).

In the dot-product collaborative filtering model, both embeddings must have the same dimension (since they are multiplied together). In the neural network model, embeddings can have different dimensions because they are concatenated instead of multiplied.

Usage

Use this heuristic when:

  • Building collaborative filtering models: To determine user and item embedding sizes
  • Building tabular deep learning models: To determine embedding sizes for categorical features
  • Choosing between dot-product and NN models: NN models can use asymmetric embedding sizes from `get_emb_sz()`

The Insight (Rule of Thumb)

  • Action: Call `get_emb_sz(dls)` to get recommended embedding dimensions.
  • Value: The function returns tuples of `(n_categories, embedding_dim)`. Example output for MovieLens: `[(944, 74), (1635, 101)]` — 74 latent factors for 944 users, 101 latent factors for 1635 movies.
  • Trade-off: Larger embeddings can capture more nuance but increase model size and overfitting risk. The `get_emb_sz` heuristic balances expressiveness with regularization.
  • For dot-product models: Both embeddings must have the same size (e.g., 50), so `get_emb_sz` output needs to be overridden.

Reasoning

The embedding size should scale sub-linearly with the number of categories. A category with 1000 unique values does not need 1000-dimensional embeddings — much of the information is redundant. The fastai heuristic has been empirically validated across many datasets and produces embedding sizes that work well in practice. The key principle is that the embedding dimension represents the number of "latent factors" the model can learn about each entity.

Code Evidence

Embedding size heuristic from `08_collab.md:641-650`:

embs = get_emb_sz(dls)
embs
# Output: [(944, 74), (1635, 101)]

Neural collab model using asymmetric embeddings from `08_collab.md:655-668`:

class CollabNN(Module):
    def __init__(self, user_sz, item_sz, y_range=(0,5.5), n_act=100):
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers = nn.Sequential(
            nn.Linear(user_sz[1]+item_sz[1], n_act),
            nn.ReLU(),
            nn.Linear(n_act, 1))
        self.y_range = y_range

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment