Environment:Fastai Fastbook NLP SpaCy Environment

Knowledge Sources	fastai/fastbook spaCy
Domains	NLP, Tokenization
Last Updated	2026-02-09 17:00 GMT

Overview

spaCy and SentencePiece tokenization environment required for NLP text classification and language model fine-tuning chapters.

Description

The NLP chapters (Ch10, Ch12) use fastai's text processing pipeline, which relies on spaCy for word-level tokenization and SentencePiece for subword tokenization. spaCy provides the base tokenizer that splits text into words while handling special cases (contractions, punctuation). SentencePiece (listed in `requirements.txt`) enables subword tokenization using the Byte Pair Encoding (BPE) or Unigram algorithms. The fastai `Tokenizer` class wraps these backends and adds special tokens (e.g., `xxbos`, `xxmaj`, `xxunk`).

Usage

Use this environment for the NLP Text Classification workflow, specifically:

Tokenization: Word-level splitting with spaCy, subword splitting with SentencePiece
Language model data: Preparing text corpora for LM fine-tuning
Text classifier: Building classifiers on top of fine-tuned language models

System Requirements

Category	Requirement	Notes
OS	Any (Linux, macOS, Windows)	No platform restrictions
Disk	1GB+ free space	For spaCy language models

Dependencies

Python Packages

`spacy` (installed as dependency of fastai)
`sentencepiece` (listed in requirements.txt)
`fastai` >= 2.0.0 (provides `Tokenizer` wrapper)

Language Models

spaCy English model: `en_core_web_sm` (downloaded separately)

Credentials

No credentials required.

Quick Install

pip install spacy sentencepiece

# Download spaCy English model
python -m spacy download en_core_web_sm

Code Evidence

SentencePiece in requirements from `requirements.txt:9`:

sentencepiece

Language model learner with dropout from `10_nlp.md:551-553`:

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()]).to_fp16()

Tokenization output showing special tokens from `10_nlp.md:540-542`:

xxbos xxmaj it 's awesome ! xxmaj in xxmaj story xxmaj mode ...

Common Errors

Error Message	Cause	Solution
`OSError: Can't find model 'en_core_web_sm'`	spaCy model not downloaded	`python -m spacy download en_core_web_sm`
`ModuleNotFoundError: No module named 'sentencepiece'`	SentencePiece not installed	`pip install sentencepiece`
Tokenization extremely slow	spaCy using wrong backend	Ensure spaCy is installed with optimized pipeline components

Compatibility Notes

Language models: Different spaCy language models are needed for non-English text. The Fastbook uses English text (IMDb reviews).
SentencePiece vs spaCy: fastai supports both tokenization backends. The default `SpacyTokenizer` uses spaCy; `SentencePieceTokenizer` is an alternative for subword-level processing.
Preprocessed data: fastai caches tokenized data, so spaCy is primarily needed during the first tokenization pass.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment