Implementation:Fastai Fastbook Untar Data Text

Knowledge Sources	fastbook fastai docs fastai.data.external
Domains	Natural Language Processing, Data Engineering
Last Updated	2026-02-09 17:00 GMT

Overview

Concrete tool for downloading, extracting, and enumerating text datasets provided by the fastai library.

Description

The fastai library provides two key utilities for text data preparation:

untar_data: Downloads a dataset from a URL, caches it locally, and extracts the archive. Returns a Path object pointing to the extracted directory. The function uses fastai's built-in URL registry (URLs) which contains constants for many standard NLP datasets including IMDb, Wikitext-103, and AG News.
get_text_files: Recursively scans specified folders within a path and returns all files with text extensions (.txt) as an L list (fastai's enhanced list type). The folders parameter allows filtering to specific subdirectories.

Together, these two functions handle the complete data acquisition pipeline from URL to enumerated file list.

Usage

Import and use these functions at the start of any fastai NLP workflow. They are typically the first lines of code in a notebook or script before any tokenization or model-building steps.

Code Reference

Source Location

Repository: fastbook
File: translations/cn/10_nlp.md (lines 99-115)
Library module: fastai.data.external (untar_data, URLs), fastai.data.transforms (get_text_files)

Signature

def untar_data(
    url: str,
    fname: str = None,
    dest: str = None,
    c_key: str = 'data',
    force_download: bool = False,
    extract_func: callable = file_extract,
    timeout: int = 4
) -> Path

def get_text_files(
    path: Path,
    recurse: bool = True,
    folders: list = None
) -> L

Import

from fastai.data.external import untar_data, URLs
from fastai.data.transforms import get_text_files

I/O Contract

Inputs

Name	Type	Required	Description
url	str	Yes	URL of the dataset archive. Use URLs.IMDB for the IMDb sentiment dataset.
fname	str	No	Override the local filename for the downloaded archive.
dest	str	No	Override the destination directory for extraction.
force_download	bool	No	If True, re-download even if cached. Default: False.
path	Path	Yes	Root path to search for text files (used by get_text_files).
folders	list	No	List of subfolder names to restrict the search. E.g., ['train', 'test'].

Outputs

Name	Type	Description
path	Path	Path object pointing to the root of the extracted dataset directory.
files	L	An L list of Path objects, one per text file found in the specified folders.

Usage Examples

Basic Usage

from fastai.data.external import untar_data, URLs
from fastai.data.transforms import get_text_files

# Download and extract the IMDb dataset
path = untar_data(URLs.IMDB)
print(path)
# Output: /home/user/.fastai/data/imdb

# List all directories
print(path.ls())
# Output: (#7) [Path('imdb.vocab'),Path('test'),Path('train'),Path('README'),...]

Enumerating Text Files

# Get all text files from both train and test folders
files = get_text_files(path, folders=['train', 'test'])
print(len(files))
# Output: 75000 (25000 train labeled + 25000 test labeled + 25000 train unsup)

# Inspect a single file
txt = files[0].open().read()
print(txt[:200])
# Output: First 200 characters of the review text

Getting Only Training Files

# Get only training text files (including unsupervised)
train_files = get_text_files(path / 'train')
print(len(train_files))
# Output: 50000

# Get only labeled positive training reviews
pos_files = get_text_files(path / 'train' / 'pos')
print(len(pos_files))
# Output: 12500

Verifying Directory Structure

# Confirm expected directory structure
print((path / 'train').ls())
# Output: (#3) [Path('neg'),Path('pos'),Path('unsup')]

print((path / 'test').ls())
# Output: (#2) [Path('neg'),Path('pos')]

# Read a single review
review = (path / 'train' / 'pos' / '0_9.txt').read_text()
print(review[:100])

Related Pages

Implements Principle

Principle:Fastai_Fastbook_Text_Data_Preparation

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment