Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Fastai Fastbook Untar Data Text

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Data Engineering
Last Updated 2026-02-09 17:00 GMT

Overview

Concrete tool for downloading, extracting, and enumerating text datasets provided by the fastai library.

Description

The fastai library provides two key utilities for text data preparation:

  • untar_data: Downloads a dataset from a URL, caches it locally, and extracts the archive. Returns a Path object pointing to the extracted directory. The function uses fastai's built-in URL registry (URLs) which contains constants for many standard NLP datasets including IMDb, Wikitext-103, and AG News.
  • get_text_files: Recursively scans specified folders within a path and returns all files with text extensions (.txt) as an L list (fastai's enhanced list type). The folders parameter allows filtering to specific subdirectories.

Together, these two functions handle the complete data acquisition pipeline from URL to enumerated file list.

Usage

Import and use these functions at the start of any fastai NLP workflow. They are typically the first lines of code in a notebook or script before any tokenization or model-building steps.

Code Reference

Source Location

  • Repository: fastbook
  • File: translations/cn/10_nlp.md (lines 99-115)
  • Library module: fastai.data.external (untar_data, URLs), fastai.data.transforms (get_text_files)

Signature

def untar_data(
    url: str,
    fname: str = None,
    dest: str = None,
    c_key: str = 'data',
    force_download: bool = False,
    extract_func: callable = file_extract,
    timeout: int = 4
) -> Path

def get_text_files(
    path: Path,
    recurse: bool = True,
    folders: list = None
) -> L

Import

from fastai.data.external import untar_data, URLs
from fastai.data.transforms import get_text_files

I/O Contract

Inputs

Name Type Required Description
url str Yes URL of the dataset archive. Use URLs.IMDB for the IMDb sentiment dataset.
fname str No Override the local filename for the downloaded archive.
dest str No Override the destination directory for extraction.
force_download bool No If True, re-download even if cached. Default: False.
path Path Yes Root path to search for text files (used by get_text_files).
folders list No List of subfolder names to restrict the search. E.g., ['train', 'test'].

Outputs

Name Type Description
path Path Path object pointing to the root of the extracted dataset directory.
files L An L list of Path objects, one per text file found in the specified folders.

Usage Examples

Basic Usage

from fastai.data.external import untar_data, URLs
from fastai.data.transforms import get_text_files

# Download and extract the IMDb dataset
path = untar_data(URLs.IMDB)
print(path)
# Output: /home/user/.fastai/data/imdb

# List all directories
print(path.ls())
# Output: (#7) [Path('imdb.vocab'),Path('test'),Path('train'),Path('README'),...]

Enumerating Text Files

# Get all text files from both train and test folders
files = get_text_files(path, folders=['train', 'test'])
print(len(files))
# Output: 75000 (25000 train labeled + 25000 test labeled + 25000 train unsup)

# Inspect a single file
txt = files[0].open().read()
print(txt[:200])
# Output: First 200 characters of the review text

Getting Only Training Files

# Get only training text files (including unsupervised)
train_files = get_text_files(path / 'train')
print(len(train_files))
# Output: 50000

# Get only labeled positive training reviews
pos_files = get_text_files(path / 'train' / 'pos')
print(len(pos_files))
# Output: 12500

Verifying Directory Structure

# Confirm expected directory structure
print((path / 'train').ls())
# Output: (#3) [Path('neg'),Path('pos'),Path('unsup')]

print((path / 'test').ls())
# Output: (#2) [Path('neg'),Path('pos')]

# Read a single review
review = (path / 'train' / 'pos' / '0_9.txt').read_text()
print(review[:100])

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment