Implementation:Recommenders team Recommenders Amazon Reviews

Knowledge Sources	Recommenders
Domains	Data Preprocessing, Sequential Recommendation, Dataset Management
Last Updated	2026-02-10 00:00 GMT

Overview

The Amazon Reviews module provides end-to-end utilities for downloading, extracting, and preprocessing Amazon product review datasets into the format required by sequential recommendation models such as SLi-Rec and DKN.

Description

This module handles the complete data pipeline from raw Amazon review data hosted on Stanford SNAP to model-ready training, validation, and test files. The pipeline comprises several stages:

Download and extraction is managed by download_and_extract, which fetches gzipped review and meta data files from the Stanford SNAP repository and extracts them to a local directory. The helper get_review_data provides a convenience wrapper that downloads and preprocesses review data in one call.

Data preprocessing is orchestrated by the data_preprocessing function, which chains together multiple internal processing steps. First, _reviews_preprocessing extracts reviewer IDs, ASINs, and Unix timestamps from raw JSON review lines. Then _meta_preprocessing extracts item-to-category mappings from product metadata. The _create_instance function merges reviews with metadata to produce labeled instances sorted by timestamp within each user.

Sampling and splitting is performed through _get_sampled_data, which randomly samples a subset of items at a configurable rate to reduce dataset size, and _data_processing, which assigns each interaction to train, validation, or test sets based on temporal ordering within each user's history (all but the last two interactions go to train, the second-to-last to validation, and the last to test).

History expansion is supported in two modes. The default _data_generating function unfolds each user's behavior sequence so that for a sequence of items [1, 2, 3, 4, 5], it produces training instances with histories [1], [1,2], [1,2,3], etc. The alternative _data_generating_no_history_expanding writes only the full sequence as a single instance.

Vocabulary generation is handled by _create_vocab, which builds frequency-sorted user, item, and category vocabularies from the training file and serializes them using cPickle.

Negative sampling is performed offline by _negative_sampling_offline, which appends randomly sampled negative items to the validation and test files, with configurable numbers of negatives per positive instance.

Usage

Use this module when preparing Amazon product review data for training sequential or knowledge-aware recommendation models. It is intended for use with deep learning models in the Recommenders library that expect interaction data formatted as user behavior sequences with item and category histories. The typical workflow is to call data_preprocessing once to generate all required train, validation, test, and vocabulary files.

Code Reference

Source Location

Repository: Recommenders
File: recommenders/datasets/amazon_reviews.py
Lines: 1-550

Signature

def get_review_data(reviews_file): ...

def data_preprocessing(
    reviews_file,
    meta_file,
    train_file,
    valid_file,
    test_file,
    user_vocab,
    item_vocab,
    cate_vocab,
    sample_rate=0.01,
    valid_num_ngs=4,
    test_num_ngs=9,
    is_history_expanding=True,
): ...

def download_and_extract(name, dest_path): ...

Import

from recommenders.datasets.amazon_reviews import (
    get_review_data,
    data_preprocessing,
    download_and_extract,
)

I/O Contract

Inputs

Name	Type	Required	Description
reviews_file	str	Yes	Path to the raw Amazon reviews JSON file (or destination for download)
meta_file	str	Yes	Path to the raw Amazon product metadata file
train_file	str	Yes	Output file path for training data
valid_file	str	Yes	Output file path for validation data
test_file	str	Yes	Output file path for test data
user_vocab	str	Yes	Output file path for pickled user vocabulary dictionary
item_vocab	str	Yes	Output file path for pickled item vocabulary dictionary
cate_vocab	str	Yes	Output file path for pickled category vocabulary dictionary
sample_rate	float	No	Fraction of unique items to sample from the dataset (default: 0.01)
valid_num_ngs	int	No	Number of negative samples per positive instance in validation set (default: 4)
test_num_ngs	int	No	Number of negative samples per positive instance in test set (default: 9)
is_history_expanding	bool	No	Whether to unfold user behavior sequences into multiple training instances (default: True)
name	str	Yes	Category name for download_and_extract, used to construct the download URL
dest_path	str	Yes	Destination file path for download_and_extract

Outputs

Name	Type	Description
get_review_data return	str	File path to the preprocessed reviews output file
download_and_extract return	str	File path to the extracted data file
data_preprocessing side effects	files	Generates train, valid, and test files with tab-separated fields (label, user_id, item_id, category, timestamp, item_history, category_history, timestamp_history), plus pickled vocabulary files and negative-sampled validation/test files

Usage Examples

Basic Data Preprocessing

from recommenders.datasets.amazon_reviews import data_preprocessing

data_dir = "/tmp/amazon_data"
reviews_file = f"{data_dir}/reviews_Electronics_5.json"
meta_file = f"{data_dir}/meta_Electronics.json"

data_preprocessing(
    reviews_file=reviews_file,
    meta_file=meta_file,
    train_file=f"{data_dir}/train.txt",
    valid_file=f"{data_dir}/valid.txt",
    test_file=f"{data_dir}/test.txt",
    user_vocab=f"{data_dir}/user_vocab.pkl",
    item_vocab=f"{data_dir}/item_vocab.pkl",
    cate_vocab=f"{data_dir}/cate_vocab.pkl",
    sample_rate=0.01,
    valid_num_ngs=4,
    test_num_ngs=9,
    is_history_expanding=True,
)

Download and Extract Only

from recommenders.datasets.amazon_reviews import download_and_extract

# Download Amazon Electronics reviews from Stanford SNAP
file_path = download_and_extract(
    name="reviews_Electronics_5.json",
    dest_path="/tmp/amazon_data/reviews_Electronics_5.json",
)
print(f"Extracted file at: {file_path}")

Quick Review Data Retrieval

from recommenders.datasets.amazon_reviews import get_review_data

# Download and preprocess reviews in one step
reviews_output = get_review_data(
    reviews_file="/tmp/amazon_data/reviews_Electronics_5.json"
)
# reviews_output is the path to the preprocessed reviews file

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment