Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Fastai Fastbook NLP SpaCy Environment

From Leeroopedia


Knowledge Sources
Domains NLP, Tokenization
Last Updated 2026-02-09 17:00 GMT

Overview

spaCy and SentencePiece tokenization environment required for NLP text classification and language model fine-tuning chapters.

Description

The NLP chapters (Ch10, Ch12) use fastai's text processing pipeline, which relies on spaCy for word-level tokenization and SentencePiece for subword tokenization. spaCy provides the base tokenizer that splits text into words while handling special cases (contractions, punctuation). SentencePiece (listed in `requirements.txt`) enables subword tokenization using the Byte Pair Encoding (BPE) or Unigram algorithms. The fastai `Tokenizer` class wraps these backends and adds special tokens (e.g., `xxbos`, `xxmaj`, `xxunk`).

Usage

Use this environment for the NLP Text Classification workflow, specifically:

  • Tokenization: Word-level splitting with spaCy, subword splitting with SentencePiece
  • Language model data: Preparing text corpora for LM fine-tuning
  • Text classifier: Building classifiers on top of fine-tuned language models

System Requirements

Category Requirement Notes
OS Any (Linux, macOS, Windows) No platform restrictions
Disk 1GB+ free space For spaCy language models

Dependencies

Python Packages

  • `spacy` (installed as dependency of fastai)
  • `sentencepiece` (listed in requirements.txt)
  • `fastai` >= 2.0.0 (provides `Tokenizer` wrapper)

Language Models

  • spaCy English model: `en_core_web_sm` (downloaded separately)

Credentials

No credentials required.

Quick Install

pip install spacy sentencepiece

# Download spaCy English model
python -m spacy download en_core_web_sm

Code Evidence

SentencePiece in requirements from `requirements.txt:9`:

sentencepiece

Language model learner with dropout from `10_nlp.md:551-553`:

learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()]).to_fp16()

Tokenization output showing special tokens from `10_nlp.md:540-542`:

xxbos xxmaj it 's awesome ! xxmaj in xxmaj story xxmaj mode ...

Common Errors

Error Message Cause Solution
`OSError: Can't find model 'en_core_web_sm'` spaCy model not downloaded `python -m spacy download en_core_web_sm`
`ModuleNotFoundError: No module named 'sentencepiece'` SentencePiece not installed `pip install sentencepiece`
Tokenization extremely slow spaCy using wrong backend Ensure spaCy is installed with optimized pipeline components

Compatibility Notes

  • Language models: Different spaCy language models are needed for non-English text. The Fastbook uses English text (IMDb reviews).
  • SentencePiece vs spaCy: fastai supports both tokenization backends. The default `SpacyTokenizer` uses spaCy; `SentencePieceTokenizer` is an alternative for subword-level processing.
  • Preprocessed data: fastai caches tokenized data, so spaCy is primarily needed during the first tokenization pass.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment