Implementation:Fastai Fastbook Untar Data Text
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Data Engineering |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Concrete tool for downloading, extracting, and enumerating text datasets provided by the fastai library.
Description
The fastai library provides two key utilities for text data preparation:
- untar_data: Downloads a dataset from a URL, caches it locally, and extracts the archive. Returns a Path object pointing to the extracted directory. The function uses fastai's built-in URL registry (URLs) which contains constants for many standard NLP datasets including IMDb, Wikitext-103, and AG News.
- get_text_files: Recursively scans specified folders within a path and returns all files with text extensions (.txt) as an L list (fastai's enhanced list type). The folders parameter allows filtering to specific subdirectories.
Together, these two functions handle the complete data acquisition pipeline from URL to enumerated file list.
Usage
Import and use these functions at the start of any fastai NLP workflow. They are typically the first lines of code in a notebook or script before any tokenization or model-building steps.
Code Reference
Source Location
- Repository: fastbook
- File: translations/cn/10_nlp.md (lines 99-115)
- Library module: fastai.data.external (untar_data, URLs), fastai.data.transforms (get_text_files)
Signature
def untar_data(
url: str,
fname: str = None,
dest: str = None,
c_key: str = 'data',
force_download: bool = False,
extract_func: callable = file_extract,
timeout: int = 4
) -> Path
def get_text_files(
path: Path,
recurse: bool = True,
folders: list = None
) -> L
Import
from fastai.data.external import untar_data, URLs
from fastai.data.transforms import get_text_files
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| url | str | Yes | URL of the dataset archive. Use URLs.IMDB for the IMDb sentiment dataset. |
| fname | str | No | Override the local filename for the downloaded archive. |
| dest | str | No | Override the destination directory for extraction. |
| force_download | bool | No | If True, re-download even if cached. Default: False. |
| path | Path | Yes | Root path to search for text files (used by get_text_files). |
| folders | list | No | List of subfolder names to restrict the search. E.g., ['train', 'test']. |
Outputs
| Name | Type | Description |
|---|---|---|
| path | Path | Path object pointing to the root of the extracted dataset directory. |
| files | L | An L list of Path objects, one per text file found in the specified folders. |
Usage Examples
Basic Usage
from fastai.data.external import untar_data, URLs
from fastai.data.transforms import get_text_files
# Download and extract the IMDb dataset
path = untar_data(URLs.IMDB)
print(path)
# Output: /home/user/.fastai/data/imdb
# List all directories
print(path.ls())
# Output: (#7) [Path('imdb.vocab'),Path('test'),Path('train'),Path('README'),...]
Enumerating Text Files
# Get all text files from both train and test folders
files = get_text_files(path, folders=['train', 'test'])
print(len(files))
# Output: 75000 (25000 train labeled + 25000 test labeled + 25000 train unsup)
# Inspect a single file
txt = files[0].open().read()
print(txt[:200])
# Output: First 200 characters of the review text
Getting Only Training Files
# Get only training text files (including unsupervised)
train_files = get_text_files(path / 'train')
print(len(train_files))
# Output: 50000
# Get only labeled positive training reviews
pos_files = get_text_files(path / 'train' / 'pos')
print(len(pos_files))
# Output: 12500
Verifying Directory Structure
# Confirm expected directory structure
print((path / 'train').ls())
# Output: (#3) [Path('neg'),Path('pos'),Path('unsup')]
print((path / 'test').ls())
# Output: (#2) [Path('neg'),Path('pos')]
# Read a single review
review = (path / 'train' / 'pos' / '0_9.txt').read_text()
print(review[:100])
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment