Overview
Provides PyTorch Dataset and DataLoader classes for loading, indexing, and batching user-item-rating data for the EmbeddingDotBias collaborative filtering model.
Description
This module contains two classes that form the data ingestion pipeline for the EmbDotBias model. RecoDataset is a PyTorch Dataset subclass that stores user, item, and rating arrays as tensors and returns (user_item_pair, rating) tuples suitable for embedding-based collaborative filtering. RecoDataLoader is a utility class that manages training and validation DataLoaders together with user/item metadata. Its from_df class method is the primary entry point: it accepts a pandas DataFrame, creates string-sorted categorical mappings with a #na# placeholder at index 0, converts raw user/item IDs to contiguous integer indices suitable for embedding lookups, performs a random train/validation split with reproducible seeding, and wraps the resulting datasets in PyTorch DataLoaders. The class also stores user2idx and item2idx mapping dictionaries and a classes dictionary for index-to-ID lookups. A show_batch method is provided for quick inspection of training batches.
Usage
Use this module when preparing data for the EmbeddingDotBias collaborative filtering model. It is the standard way to convert a pandas DataFrame of user-item-rating interactions into PyTorch DataLoaders with proper categorical index encoding. Use RecoDataLoader.from_df to create the full data pipeline from a DataFrame, or instantiate RecoDataset directly if you need custom DataLoader configurations.
Code Reference
Source Location
Signature
class RecoDataset(Dataset):
def __init__(self, users, items, ratings)
def __len__(self)
def __getitem__(self, idx)
class RecoDataLoader:
def __init__(self, train_dl, valid_dl=None)
@classmethod
def from_df(
cls,
ratings,
valid_pct=0.2,
user_name=None,
item_name=None,
rating_name=None,
seed=42,
batch_size=64,
**kwargs,
)
def show_batch(self, n=5)
Import
from recommenders.models.embdotbias.data_loader import RecoDataset, RecoDataLoader
I/O Contract
Inputs
RecoDataset.__init__
| Name |
Type |
Required |
Description
|
| users |
array-like |
Yes |
User IDs or indices
|
| items |
array-like |
Yes |
Item IDs or indices
|
| ratings |
array-like |
Yes |
Ratings or interaction values
|
RecoDataLoader.from_df
| Name |
Type |
Required |
Description
|
| ratings |
pd.DataFrame |
Yes |
DataFrame containing user, item, and rating columns
|
| valid_pct |
float |
No |
Fraction of data for validation (default 0.2)
|
| user_name |
str |
No |
Name of the user column (defaults to first column)
|
| item_name |
str |
No |
Name of the item column (defaults to second column)
|
| rating_name |
str |
No |
Name of the rating column (defaults to third column)
|
| seed |
int |
No |
Random seed for reproducibility (default 42)
|
| batch_size |
int |
No |
Batch size for DataLoaders (default 64)
|
| **kwargs |
dict |
No |
Additional DataLoader arguments
|
RecoDataLoader.show_batch
| Name |
Type |
Required |
Description
|
| n |
int |
No |
Number of examples to show from the batch (default 5)
|
Outputs
RecoDataset.__getitem__
| Name |
Type |
Description
|
| return |
tuple(Tensor, Tensor) |
A tuple of (user_item_tensor of shape [2], rating_tensor of shape [1])
|
RecoDataLoader.from_df
| Name |
Type |
Description
|
| return |
RecoDataLoader |
Instance with train/valid DataLoaders and metadata (classes, n_users, n_items, user2idx, item2idx)
|
RecoDataLoader.show_batch
| Name |
Type |
Description
|
| return |
None |
Prints a sample of training batch data to stdout
|
Usage Examples
Basic Usage
import pandas as pd
from recommenders.models.embdotbias.data_loader import RecoDataLoader
# Prepare a DataFrame with user, item, and rating columns
df = pd.DataFrame({
"userID": [1, 1, 2, 2, 3],
"itemID": [10, 20, 10, 30, 20],
"rating": [4.0, 3.5, 5.0, 2.0, 4.5],
})
# Create DataLoaders from the DataFrame
data = RecoDataLoader.from_df(
df,
valid_pct=0.2,
user_name="userID",
item_name="itemID",
rating_name="rating",
seed=42,
batch_size=32,
)
# Access metadata
print(f"Number of users: {data.n_users}")
print(f"Number of items: {data.n_items}")
print(f"User classes: {data.classes['userID']}")
# Inspect a training batch
data.show_batch(n=3)
# Iterate over training DataLoader
for user_item_batch, ratings_batch in data.train:
users = user_item_batch[:, 0]
items = user_item_batch[:, 1]
# Feed to model...
break
Related Pages