Implementation:Recommenders team Recommenders Load Pandas Df
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Data Loading, Benchmark Datasets |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for loading MovieLens benchmark datasets into pandas DataFrames provided by the recommenders library.
Description
The load_pandas_df function downloads, caches, extracts, and parses MovieLens datasets into a pandas DataFrame. It supports five dataset sizes (100K, 1M, 10M, 20M, and a mock dataset for testing) and handles the format differences between them transparently. The function can optionally join movie metadata columns (title, genres, release year) onto the ratings data. It uses a local cache to avoid re-downloading and supports a mock data mode for unit testing scenarios.
Usage
Import and call this function at the start of a recommender system experiment pipeline when you need MovieLens data in a pandas DataFrame. Use the size parameter to select the dataset scale, and pass title_col, genres_col, or year_col to include movie metadata in the output.
Code Reference
Source Location
- Repository: recommenders
- File:
recommenders/datasets/movielens.py - Lines: L152-L251
Signature
def load_pandas_df(
size="100k",
header=None,
local_cache_path=None,
title_col=None,
genres_col=None,
year_col=None,
) -> pd.DataFrame
Import
from recommenders.datasets.movielens import load_pandas_df
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| size | str | No (default: "100k") | Size of the MovieLens dataset to load. One of "100k", "1m", "10m", "20m", "mock100". |
| header | list or tuple or None | No (default: None) | Column names for the rating data. If None, uses DEFAULT_HEADER (userID, itemID, rating, timestamp). Truncated to 4 elements if longer. |
| local_cache_path | str or None | No (default: None) | Directory or zip file path for caching the downloaded archive. If None, uses a temporary directory that is cleaned up after use. |
| title_col | str or None | No (default: None) | Column name for the movie title. If None, title is not loaded. |
| genres_col | str or None | No (default: None) | Column name for movie genres (pipe-separated string). If None, genres are not loaded. |
| year_col | str or None | No (default: None) | Column name for movie release year. If None, year is not loaded. Ignored for mock data. |
Outputs
| Name | Type | Description |
|---|---|---|
| return | pd.DataFrame | DataFrame containing user-item-rating-timestamp columns, plus any requested metadata columns (title, genres, year). Rating column is cast to float. |
Usage Examples
Basic Usage
from recommenders.datasets.movielens import load_pandas_df
# Load MovieLens 100K with default columns (userID, itemID, rating, timestamp)
df = load_pandas_df("100k")
# Load MovieLens 1M with custom column names
df = load_pandas_df("1m", header=["UserId", "ItemId", "Rating", "Timestamp"])
# Load with movie metadata
df = load_pandas_df(
"1m",
header=["UserId", "ItemId", "Rating", "Timestamp"],
title_col="Title",
genres_col="Genres",
year_col="Year",
)
# Load mock data for testing
df = load_pandas_df("mock100")
Dependencies
- pandas - DataFrame construction and CSV parsing
- pandera - Schema validation for mock data
- zipfile - Archive extraction
- os / tempfile - File path management and temporary directories