Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Fastai Fastbook Collab Untar Data

From Leeroopedia


Knowledge Sources
Domains Recommender Systems, Data Engineering
Last Updated 2026-02-09 17:00 GMT

Overview

Concrete tool for downloading, extracting, and parsing the MovieLens collaborative filtering dataset provided by fastai and pandas.

Description

This implementation uses untar_data from fastai.data.external to download and cache the MovieLens 100K dataset, then uses pd.read_csv to parse the tab-delimited ratings file and the pipe-delimited movie metadata file. The two DataFrames are merged on the movie ID column to produce a single enriched ratings DataFrame containing user IDs, movie IDs, ratings, timestamps, and human-readable movie titles.

Usage

Import and run these functions at the start of any collaborative filtering notebook or script. The untar_data call is idempotent: if the dataset has already been downloaded and extracted, it returns the cached path immediately without re-downloading.

Code Reference

Source Location

  • Repository: fastbook
  • File: translations/cn/08_collab.md (Lines 24-122)

Signature

# Dataset download and extraction
untar_data(url: str) -> Path

# Ratings file parsing
pd.read_csv(
    filepath_or_buffer,
    delimiter: str = '\t',
    header: int = None,
    names: list = ['user', 'movie', 'rating', 'timestamp']
) -> pd.DataFrame

# Movie metadata parsing
pd.read_csv(
    filepath_or_buffer,
    delimiter: str = '|',
    encoding: str = 'latin-1',
    usecols: tuple = (0, 1),
    names: tuple = ('movie', 'title'),
    header: int = None
) -> pd.DataFrame

# Merge ratings with movie titles
ratings.merge(movies) -> pd.DataFrame

Import

from fastai.collab import *
from fastai.tabular.all import *
import pandas as pd

I/O Contract

Inputs

Name Type Required Description
url str Yes URL constant for the dataset; use URLs.ML_100k for MovieLens 100K
path/'u.data' Path Yes Tab-delimited ratings file within the extracted archive
path/'u.item' Path Yes Pipe-delimited movie metadata file within the extracted archive

Outputs

Name Type Description
path Path Local filesystem path to the extracted dataset directory
ratings pd.DataFrame DataFrame with columns: user (int), movie (int), rating (int), timestamp (int)
movies pd.DataFrame DataFrame with columns: movie (int), title (str)
ratings (merged) pd.DataFrame DataFrame with columns: user, movie, rating, timestamp, title after merge

Usage Examples

Basic Usage

from fastai.collab import *
from fastai.tabular.all import *

# Step 1: Download and extract the MovieLens 100K dataset
path = untar_data(URLs.ML_100k)

# Step 2: Parse the ratings file (tab-delimited, no header)
ratings = pd.read_csv(path/'u.data', delimiter='\t', header=None,
                      names=['user','movie','rating','timestamp'])
ratings.head()
# Output:
#    user  movie  rating  timestamp
# 0   196    242       3  881250949
# 1   186    302       3  891717742
# 2    22    377       1  878887116
# 3   244     51       2  880606923
# 4   166    346       1  886397596

# Step 3: Parse the movie metadata file (pipe-delimited, latin-1 encoding)
movies = pd.read_csv(path/'u.item', delimiter='|', encoding='latin-1',
                     usecols=(0,1), names=('movie','title'), header=None)
movies.head()
# Output:
#    movie                 title
# 0      1    Toy Story (1995)
# 1      2    GoldenEye (1995)
# 2      3    Four Rooms (1995)
# 3      4    Get Shorty (1995)
# 4      5    Copycat (1995)

# Step 4: Merge ratings with movie titles
ratings = ratings.merge(movies)
ratings.head()
# Output:
#    user  movie  rating  timestamp             title
# 0   196    242       3  881250949   Kolya (1996)
# 1    63    242       3  875747190   Kolya (1996)
# 2   226    242       5  883888671   Kolya (1996)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment