Implementation:Fastai Fastbook Collab Untar Data

Knowledge Sources	fastbook fastai docs
Domains	Recommender Systems, Data Engineering
Last Updated	2026-02-09 17:00 GMT

Overview

Concrete tool for downloading, extracting, and parsing the MovieLens collaborative filtering dataset provided by fastai and pandas.

Description

This implementation uses untar_data from fastai.data.external to download and cache the MovieLens 100K dataset, then uses pd.read_csv to parse the tab-delimited ratings file and the pipe-delimited movie metadata file. The two DataFrames are merged on the movie ID column to produce a single enriched ratings DataFrame containing user IDs, movie IDs, ratings, timestamps, and human-readable movie titles.

Usage

Import and run these functions at the start of any collaborative filtering notebook or script. The untar_data call is idempotent: if the dataset has already been downloaded and extracted, it returns the cached path immediately without re-downloading.

Code Reference

Source Location

Repository: fastbook
File: translations/cn/08_collab.md (Lines 24-122)

Signature

# Dataset download and extraction
untar_data(url: str) -> Path

# Ratings file parsing
pd.read_csv(
    filepath_or_buffer,
    delimiter: str = '\t',
    header: int = None,
    names: list = ['user', 'movie', 'rating', 'timestamp']
) -> pd.DataFrame

# Movie metadata parsing
pd.read_csv(
    filepath_or_buffer,
    delimiter: str = '|',
    encoding: str = 'latin-1',
    usecols: tuple = (0, 1),
    names: tuple = ('movie', 'title'),
    header: int = None
) -> pd.DataFrame

# Merge ratings with movie titles
ratings.merge(movies) -> pd.DataFrame

Import

from fastai.collab import *
from fastai.tabular.all import *
import pandas as pd

I/O Contract

Inputs

Name	Type	Required	Description
url	str	Yes	URL constant for the dataset; use `URLs.ML_100k` for MovieLens 100K
path/'u.data'	Path	Yes	Tab-delimited ratings file within the extracted archive
path/'u.item'	Path	Yes	Pipe-delimited movie metadata file within the extracted archive

Outputs

Name	Type	Description
path	Path	Local filesystem path to the extracted dataset directory
ratings	pd.DataFrame	DataFrame with columns: user (int), movie (int), rating (int), timestamp (int)
movies	pd.DataFrame	DataFrame with columns: movie (int), title (str)
ratings (merged)	pd.DataFrame	DataFrame with columns: user, movie, rating, timestamp, title after merge

Usage Examples

Basic Usage

from fastai.collab import *
from fastai.tabular.all import *

# Step 1: Download and extract the MovieLens 100K dataset
path = untar_data(URLs.ML_100k)

# Step 2: Parse the ratings file (tab-delimited, no header)
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user','movie','rating','timestamp'])
ratings.head()
# Output:
#    user  movie  rating  timestamp
# 0   196    242       3  881250949
# 1   186    302       3  891717742
# 2    22    377       1  878887116
# 3   244     51       2  880606923
# 4   166    346       1  886397596

# Step 3: Parse the movie metadata file (pipe-delimited, latin-1 encoding)
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)
movies.head()
# Output:
#    movie                 title
# 0      1    Toy Story (1995)
# 1      2    GoldenEye (1995)
# 2      3    Four Rooms (1995)
# 3      4    Get Shorty (1995)
# 4      5    Copycat (1995)

# Step 4: Merge ratings with movie titles
ratings = ratings.merge(movies)
ratings.head()
# Output:
#    user  movie  rating  timestamp             title
# 0   196    242       3  881250949   Kolya (1996)
# 1    63    242       3  875747190   Kolya (1996)
# 2   226    242       5  883888671   Kolya (1996)

Related Pages

Implements Principle

Principle:Fastai_Fastbook_Collab_Data_Loading

Requires Environment

Environment:Fastai_Fastbook_Python_FastAI_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment