Implementation:Recommenders team Recommenders NCF Dataset Init
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Implicit Feedback, Data Preparation |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for preparing implicit feedback data with negative sampling for Neural Collaborative Filtering provided by the recommenders library.
Description
Dataset.__init__ initializes the NCF dataset by loading training (and optionally test) interaction files, building user/item ID mappings, and configuring negative sampling parameters. When a test file is provided, it automatically generates a full test file that includes n_neg_test negative samples per positive test interaction, enabling the leave-one-out evaluation protocol. The class also provides a train_loader method that yields batches of (user, item, label) tuples with n_neg negatives sampled per positive example during training.
The binary flag converts any non-zero rating to 1, which is the standard treatment for implicit feedback. User and item IDs are mapped to contiguous integer indices via user2id and item2id dictionaries, which are later consumed by the NCF model's embedding layers.
Usage
Import and instantiate Dataset after splitting your interaction data into training and test CSV files. This is the required data preparation step before calling NCF.fit(). The training file is mandatory; the test file is needed only for evaluation. Adjust n_neg to control the training negative sampling ratio and n_neg_test for evaluation.
Code Reference
Source Location
- Repository: recommenders
- File: recommenders/models/ncf/dataset.py
- Lines: 304-391
Signature
class Dataset(object):
def __init__(
self,
train_file,
test_file=None,
test_file_full=None,
overwrite_test_file_full=False,
n_neg=4,
n_neg_test=100,
col_user=DEFAULT_USER_COL,
col_item=DEFAULT_ITEM_COL,
col_rating=DEFAULT_RATING_COL,
binary=True,
seed=None,
sample_with_replacement=False,
print_warnings=False,
):
Import
from recommenders.models.ncf.dataset import Dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| train_file | str | Yes | Path to the training dataset CSV file containing user-item interactions |
| test_file | str | No | Path to the test dataset CSV file for leave-one-out evaluation. Defaults to None
|
| test_file_full | str | No | Path to the full test file including negative samples. If None and test_file is provided, auto-generated as test_file_full.csv
|
| overwrite_test_file_full | bool | No | If True, regenerate and overwrite the full test file even if it already exists. Defaults to False
|
| n_neg | int | No | Number of negative samples per positive example during training. Defaults to 4 |
| n_neg_test | int | No | Number of negative samples per positive example for evaluation. Defaults to 100 |
| col_user | str | No | Name of the user ID column. Defaults to DEFAULT_USER_COL ("userID")
|
| col_item | str | No | Name of the item ID column. Defaults to DEFAULT_ITEM_COL ("itemID")
|
| col_rating | str | No | Name of the rating column. Defaults to DEFAULT_RATING_COL ("rating")
|
| binary | bool | No | If True, convert all non-zero ratings to 1 (implicit feedback). Defaults to True
|
| seed | int | No | Random seed for reproducible negative sampling. Defaults to None
|
| sample_with_replacement | bool | No | If True, sample negatives with replacement. Defaults to False
|
| print_warnings | bool | No | If True, print warnings when insufficient items exist for sampling without replacement. Defaults to False
|
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Dataset | An initialized Dataset object with the following key attributes |
| dataset.n_users | int | Total number of unique users in the training data |
| dataset.n_items | int | Total number of unique items in the training data |
| dataset.user2id | dict | Mapping from original user IDs to contiguous integer indices |
| dataset.item2id | dict | Mapping from original item IDs to contiguous integer indices |
| dataset.id2user | dict | Reverse mapping from integer indices to original user IDs |
| dataset.id2item | dict | Reverse mapping from integer indices to original item IDs |
| dataset.train_len | int | Number of interactions in the training file |
Usage Examples
Basic Usage
from recommenders.models.ncf.dataset import Dataset
# Initialize dataset with training and test files
data = Dataset(
train_file="train.csv",
test_file="test.csv",
n_neg=4,
n_neg_test=100,
binary=True,
seed=42,
)
print(f"Users: {data.n_users}, Items: {data.n_items}")
print(f"Training interactions: {data.train_len}")
# The dataset is now ready to be passed to NCF.fit()
With MovieLens Data
from recommenders.datasets import movielens
from recommenders.datasets.python_splitters import python_chrono_split
from recommenders.models.ncf.dataset import Dataset
# Load and split data
df = movielens.load_pandas_df(size="100k")
train, test = python_chrono_split(df, ratio=0.75)
# Save to temporary files for Dataset
train.to_csv("train.csv", index=False)
test.to_csv("test.csv", index=False)
# Create NCF dataset with negative sampling
data = Dataset(
train_file="train.csv",
test_file="test.csv",
n_neg=4,
n_neg_test=100,
binary=True,
seed=42,
)