Implementation:Fastai Fastbook Collab Untar Data
| Knowledge Sources | |
|---|---|
| Domains | Recommender Systems, Data Engineering |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Concrete tool for downloading, extracting, and parsing the MovieLens collaborative filtering dataset provided by fastai and pandas.
Description
This implementation uses untar_data from fastai.data.external to download and cache the MovieLens 100K dataset, then uses pd.read_csv to parse the tab-delimited ratings file and the pipe-delimited movie metadata file. The two DataFrames are merged on the movie ID column to produce a single enriched ratings DataFrame containing user IDs, movie IDs, ratings, timestamps, and human-readable movie titles.
Usage
Import and run these functions at the start of any collaborative filtering notebook or script. The untar_data call is idempotent: if the dataset has already been downloaded and extracted, it returns the cached path immediately without re-downloading.
Code Reference
Source Location
- Repository: fastbook
- File: translations/cn/08_collab.md (Lines 24-122)
Signature
# Dataset download and extraction
untar_data(url: str) -> Path
# Ratings file parsing
pd.read_csv(
filepath_or_buffer,
delimiter: str = '\t',
header: int = None,
names: list = ['user', 'movie', 'rating', 'timestamp']
) -> pd.DataFrame
# Movie metadata parsing
pd.read_csv(
filepath_or_buffer,
delimiter: str = '|',
encoding: str = 'latin-1',
usecols: tuple = (0, 1),
names: tuple = ('movie', 'title'),
header: int = None
) -> pd.DataFrame
# Merge ratings with movie titles
ratings.merge(movies) -> pd.DataFrame
Import
from fastai.collab import *
from fastai.tabular.all import *
import pandas as pd
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| url | str | Yes | URL constant for the dataset; use URLs.ML_100k for MovieLens 100K
|
| path/'u.data' | Path | Yes | Tab-delimited ratings file within the extracted archive |
| path/'u.item' | Path | Yes | Pipe-delimited movie metadata file within the extracted archive |
Outputs
| Name | Type | Description |
|---|---|---|
| path | Path | Local filesystem path to the extracted dataset directory |
| ratings | pd.DataFrame | DataFrame with columns: user (int), movie (int), rating (int), timestamp (int) |
| movies | pd.DataFrame | DataFrame with columns: movie (int), title (str) |
| ratings (merged) | pd.DataFrame | DataFrame with columns: user, movie, rating, timestamp, title after merge |
Usage Examples
Basic Usage
from fastai.collab import *
from fastai.tabular.all import *
# Step 1: Download and extract the MovieLens 100K dataset
path = untar_data(URLs.ML_100k)
# Step 2: Parse the ratings file (tab-delimited, no header)
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
names=['user','movie','rating','timestamp'])
ratings.head()
# Output:
# user movie rating timestamp
# 0 196 242 3 881250949
# 1 186 302 3 891717742
# 2 22 377 1 878887116
# 3 244 51 2 880606923
# 4 166 346 1 886397596
# Step 3: Parse the movie metadata file (pipe-delimited, latin-1 encoding)
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
usecols=(0,1), names=('movie','title'), header=None)
movies.head()
# Output:
# movie title
# 0 1 Toy Story (1995)
# 1 2 GoldenEye (1995)
# 2 3 Four Rooms (1995)
# 3 4 Get Shorty (1995)
# 4 5 Copycat (1995)
# Step 4: Merge ratings with movie titles
ratings = ratings.merge(movies)
ratings.head()
# Output:
# user movie rating timestamp title
# 0 196 242 3 881250949 Kolya (1996)
# 1 63 242 3 875747190 Kolya (1996)
# 2 226 242 5 883888671 Kolya (1996)