Implementation:Recommenders team Recommenders Amazon Reviews
| Knowledge Sources | |
|---|---|
| Domains | Data Preprocessing, Sequential Recommendation, Dataset Management |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
The Amazon Reviews module provides end-to-end utilities for downloading, extracting, and preprocessing Amazon product review datasets into the format required by sequential recommendation models such as SLi-Rec and DKN.
Description
This module handles the complete data pipeline from raw Amazon review data hosted on Stanford SNAP to model-ready training, validation, and test files. The pipeline comprises several stages:
Download and extraction is managed by download_and_extract, which fetches gzipped review and meta data files from the Stanford SNAP repository and extracts them to a local directory. The helper get_review_data provides a convenience wrapper that downloads and preprocesses review data in one call.
Data preprocessing is orchestrated by the data_preprocessing function, which chains together multiple internal processing steps. First, _reviews_preprocessing extracts reviewer IDs, ASINs, and Unix timestamps from raw JSON review lines. Then _meta_preprocessing extracts item-to-category mappings from product metadata. The _create_instance function merges reviews with metadata to produce labeled instances sorted by timestamp within each user.
Sampling and splitting is performed through _get_sampled_data, which randomly samples a subset of items at a configurable rate to reduce dataset size, and _data_processing, which assigns each interaction to train, validation, or test sets based on temporal ordering within each user's history (all but the last two interactions go to train, the second-to-last to validation, and the last to test).
History expansion is supported in two modes. The default _data_generating function unfolds each user's behavior sequence so that for a sequence of items [1, 2, 3, 4, 5], it produces training instances with histories [1], [1,2], [1,2,3], etc. The alternative _data_generating_no_history_expanding writes only the full sequence as a single instance.
Vocabulary generation is handled by _create_vocab, which builds frequency-sorted user, item, and category vocabularies from the training file and serializes them using cPickle.
Negative sampling is performed offline by _negative_sampling_offline, which appends randomly sampled negative items to the validation and test files, with configurable numbers of negatives per positive instance.
Usage
Use this module when preparing Amazon product review data for training sequential or knowledge-aware recommendation models. It is intended for use with deep learning models in the Recommenders library that expect interaction data formatted as user behavior sequences with item and category histories. The typical workflow is to call data_preprocessing once to generate all required train, validation, test, and vocabulary files.
Code Reference
Source Location
- Repository: Recommenders
- File: recommenders/datasets/amazon_reviews.py
- Lines: 1-550
Signature
def get_review_data(reviews_file): ...
def data_preprocessing(
reviews_file,
meta_file,
train_file,
valid_file,
test_file,
user_vocab,
item_vocab,
cate_vocab,
sample_rate=0.01,
valid_num_ngs=4,
test_num_ngs=9,
is_history_expanding=True,
): ...
def download_and_extract(name, dest_path): ...
Import
from recommenders.datasets.amazon_reviews import (
get_review_data,
data_preprocessing,
download_and_extract,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| reviews_file | str | Yes | Path to the raw Amazon reviews JSON file (or destination for download) |
| meta_file | str | Yes | Path to the raw Amazon product metadata file |
| train_file | str | Yes | Output file path for training data |
| valid_file | str | Yes | Output file path for validation data |
| test_file | str | Yes | Output file path for test data |
| user_vocab | str | Yes | Output file path for pickled user vocabulary dictionary |
| item_vocab | str | Yes | Output file path for pickled item vocabulary dictionary |
| cate_vocab | str | Yes | Output file path for pickled category vocabulary dictionary |
| sample_rate | float | No | Fraction of unique items to sample from the dataset (default: 0.01) |
| valid_num_ngs | int | No | Number of negative samples per positive instance in validation set (default: 4) |
| test_num_ngs | int | No | Number of negative samples per positive instance in test set (default: 9) |
| is_history_expanding | bool | No | Whether to unfold user behavior sequences into multiple training instances (default: True) |
| name | str | Yes | Category name for download_and_extract, used to construct the download URL |
| dest_path | str | Yes | Destination file path for download_and_extract |
Outputs
| Name | Type | Description |
|---|---|---|
| get_review_data return | str | File path to the preprocessed reviews output file |
| download_and_extract return | str | File path to the extracted data file |
| data_preprocessing side effects | files | Generates train, valid, and test files with tab-separated fields (label, user_id, item_id, category, timestamp, item_history, category_history, timestamp_history), plus pickled vocabulary files and negative-sampled validation/test files |
Usage Examples
Basic Data Preprocessing
from recommenders.datasets.amazon_reviews import data_preprocessing
data_dir = "/tmp/amazon_data"
reviews_file = f"{data_dir}/reviews_Electronics_5.json"
meta_file = f"{data_dir}/meta_Electronics.json"
data_preprocessing(
reviews_file=reviews_file,
meta_file=meta_file,
train_file=f"{data_dir}/train.txt",
valid_file=f"{data_dir}/valid.txt",
test_file=f"{data_dir}/test.txt",
user_vocab=f"{data_dir}/user_vocab.pkl",
item_vocab=f"{data_dir}/item_vocab.pkl",
cate_vocab=f"{data_dir}/cate_vocab.pkl",
sample_rate=0.01,
valid_num_ngs=4,
test_num_ngs=9,
is_history_expanding=True,
)
Download and Extract Only
from recommenders.datasets.amazon_reviews import download_and_extract
# Download Amazon Electronics reviews from Stanford SNAP
file_path = download_and_extract(
name="reviews_Electronics_5.json",
dest_path="/tmp/amazon_data/reviews_Electronics_5.json",
)
print(f"Extracted file at: {file_path}")
Quick Review Data Retrieval
from recommenders.datasets.amazon_reviews import get_review_data
# Download and preprocess reviews in one step
reviews_output = get_review_data(
reviews_file="/tmp/amazon_data/reviews_Electronics_5.json"
)
# reviews_output is the path to the preprocessed reviews file