Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Recommenders team Recommenders EmbDotBias DataLoader

From Leeroopedia


Knowledge Sources
Domains Collaborative Filtering, Data Loading, PyTorch
Last Updated 2026-02-10 00:00 GMT

Overview

Provides PyTorch Dataset and DataLoader classes for loading, indexing, and batching user-item-rating data for the EmbeddingDotBias collaborative filtering model.

Description

This module contains two classes that form the data ingestion pipeline for the EmbDotBias model. RecoDataset is a PyTorch Dataset subclass that stores user, item, and rating arrays as tensors and returns (user_item_pair, rating) tuples suitable for embedding-based collaborative filtering. RecoDataLoader is a utility class that manages training and validation DataLoaders together with user/item metadata. Its from_df class method is the primary entry point: it accepts a pandas DataFrame, creates string-sorted categorical mappings with a #na# placeholder at index 0, converts raw user/item IDs to contiguous integer indices suitable for embedding lookups, performs a random train/validation split with reproducible seeding, and wraps the resulting datasets in PyTorch DataLoaders. The class also stores user2idx and item2idx mapping dictionaries and a classes dictionary for index-to-ID lookups. A show_batch method is provided for quick inspection of training batches.

Usage

Use this module when preparing data for the EmbeddingDotBias collaborative filtering model. It is the standard way to convert a pandas DataFrame of user-item-rating interactions into PyTorch DataLoaders with proper categorical index encoding. Use RecoDataLoader.from_df to create the full data pipeline from a DataFrame, or instantiate RecoDataset directly if you need custom DataLoader configurations.

Code Reference

Source Location

Signature

class RecoDataset(Dataset):
    def __init__(self, users, items, ratings)
    def __len__(self)
    def __getitem__(self, idx)

class RecoDataLoader:
    def __init__(self, train_dl, valid_dl=None)

    @classmethod
    def from_df(
        cls,
        ratings,
        valid_pct=0.2,
        user_name=None,
        item_name=None,
        rating_name=None,
        seed=42,
        batch_size=64,
        **kwargs,
    )

    def show_batch(self, n=5)

Import

from recommenders.models.embdotbias.data_loader import RecoDataset, RecoDataLoader

I/O Contract

Inputs

RecoDataset.__init__

Name Type Required Description
users array-like Yes User IDs or indices
items array-like Yes Item IDs or indices
ratings array-like Yes Ratings or interaction values

RecoDataLoader.from_df

Name Type Required Description
ratings pd.DataFrame Yes DataFrame containing user, item, and rating columns
valid_pct float No Fraction of data for validation (default 0.2)
user_name str No Name of the user column (defaults to first column)
item_name str No Name of the item column (defaults to second column)
rating_name str No Name of the rating column (defaults to third column)
seed int No Random seed for reproducibility (default 42)
batch_size int No Batch size for DataLoaders (default 64)
**kwargs dict No Additional DataLoader arguments

RecoDataLoader.show_batch

Name Type Required Description
n int No Number of examples to show from the batch (default 5)

Outputs

RecoDataset.__getitem__

Name Type Description
return tuple(Tensor, Tensor) A tuple of (user_item_tensor of shape [2], rating_tensor of shape [1])

RecoDataLoader.from_df

Name Type Description
return RecoDataLoader Instance with train/valid DataLoaders and metadata (classes, n_users, n_items, user2idx, item2idx)

RecoDataLoader.show_batch

Name Type Description
return None Prints a sample of training batch data to stdout

Usage Examples

Basic Usage

import pandas as pd
from recommenders.models.embdotbias.data_loader import RecoDataLoader

# Prepare a DataFrame with user, item, and rating columns
df = pd.DataFrame({
    "userID": [1, 1, 2, 2, 3],
    "itemID": [10, 20, 10, 30, 20],
    "rating": [4.0, 3.5, 5.0, 2.0, 4.5],
})

# Create DataLoaders from the DataFrame
data = RecoDataLoader.from_df(
    df,
    valid_pct=0.2,
    user_name="userID",
    item_name="itemID",
    rating_name="rating",
    seed=42,
    batch_size=32,
)

# Access metadata
print(f"Number of users: {data.n_users}")
print(f"Number of items: {data.n_items}")
print(f"User classes: {data.classes['userID']}")

# Inspect a training batch
data.show_batch(n=3)

# Iterate over training DataLoader
for user_item_batch, ratings_batch in data.train:
    users = user_item_batch[:, 0]
    items = user_item_batch[:, 1]
    # Feed to model...
    break

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment