Environment:Fastai Fastbook NLP SpaCy Environment
| Knowledge Sources | |
|---|---|
| Domains | NLP, Tokenization |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
spaCy and SentencePiece tokenization environment required for NLP text classification and language model fine-tuning chapters.
Description
The NLP chapters (Ch10, Ch12) use fastai's text processing pipeline, which relies on spaCy for word-level tokenization and SentencePiece for subword tokenization. spaCy provides the base tokenizer that splits text into words while handling special cases (contractions, punctuation). SentencePiece (listed in `requirements.txt`) enables subword tokenization using the Byte Pair Encoding (BPE) or Unigram algorithms. The fastai `Tokenizer` class wraps these backends and adds special tokens (e.g., `xxbos`, `xxmaj`, `xxunk`).
Usage
Use this environment for the NLP Text Classification workflow, specifically:
- Tokenization: Word-level splitting with spaCy, subword splitting with SentencePiece
- Language model data: Preparing text corpora for LM fine-tuning
- Text classifier: Building classifiers on top of fine-tuned language models
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Any (Linux, macOS, Windows) | No platform restrictions |
| Disk | 1GB+ free space | For spaCy language models |
Dependencies
Python Packages
- `spacy` (installed as dependency of fastai)
- `sentencepiece` (listed in requirements.txt)
- `fastai` >= 2.0.0 (provides `Tokenizer` wrapper)
Language Models
- spaCy English model: `en_core_web_sm` (downloaded separately)
Credentials
No credentials required.
Quick Install
pip install spacy sentencepiece
# Download spaCy English model
python -m spacy download en_core_web_sm
Code Evidence
SentencePiece in requirements from `requirements.txt:9`:
sentencepiece
Language model learner with dropout from `10_nlp.md:551-553`:
learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]).to_fp16()
Tokenization output showing special tokens from `10_nlp.md:540-542`:
xxbos xxmaj it 's awesome ! xxmaj in xxmaj story xxmaj mode ...
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `OSError: Can't find model 'en_core_web_sm'` | spaCy model not downloaded | `python -m spacy download en_core_web_sm` |
| `ModuleNotFoundError: No module named 'sentencepiece'` | SentencePiece not installed | `pip install sentencepiece` |
| Tokenization extremely slow | spaCy using wrong backend | Ensure spaCy is installed with optimized pipeline components |
Compatibility Notes
- Language models: Different spaCy language models are needed for non-English text. The Fastbook uses English text (IMDb reviews).
- SentencePiece vs spaCy: fastai supports both tokenization backends. The default `SpacyTokenizer` uses spaCy; `SentencePieceTokenizer` is an alternative for subword-level processing.
- Preprocessed data: fastai caches tokenized data, so spaCy is primarily needed during the first tokenization pass.
Related Pages
- Implementation:Fastai_Fastbook_Tokenizer
- Implementation:Fastai_Fastbook_Numericalize
- Implementation:Fastai_Fastbook_LMDataLoader
- Implementation:Fastai_Fastbook_Language_Model_Learner
- Implementation:Fastai_Fastbook_Text_Classifier_DataLoaders
- Implementation:Fastai_Fastbook_Text_Classifier_Learner
- Implementation:Fastai_Fastbook_Untar_Data_Text