Principle:Fastai Fastbook Tokenization
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Text Preprocessing, Computational Linguistics |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Tokenization is the process of splitting raw text into a sequence of discrete units (tokens) that serve as the atomic input elements for language models and text classifiers.
Description
Tokenization bridges the gap between raw human-readable text and the numerical representations required by neural networks. It operates in two conceptual stages:
- Base tokenization: Splitting the raw character stream into word-level or subword-level units using linguistic rules, statistical models, or both.
- Token augmentation: Inserting special tokens that encode document boundaries, capitalization patterns, repetition, and other meta-information that would otherwise be lost during the lowercasing and normalization steps.
There are three major families of tokenization:
- Word-level tokenization: Splits text on whitespace and punctuation boundaries. Simple and interpretable, but produces large vocabularies and cannot handle out-of-vocabulary (OOV) words gracefully.
- Subword tokenization: Algorithms like Byte Pair Encoding (BPE), WordPiece, and SentencePiece split rare words into frequently occurring subword units. This balances vocabulary size against sequence length.
- Character-level tokenization: Each character becomes a token. Produces the smallest vocabulary but the longest sequences, making training slow and long-range dependencies harder to learn.
The ULMFiT approach (Howard & Ruder, 2018) uses word-level tokenization via spaCy as the base tokenizer, augmented with a set of special tokens that preserve information typically lost during preprocessing.
Usage
Tokenization is required in every NLP pipeline between raw text input and numericalization. Use word-level tokenization when:
- Working with the ULMFiT / AWD-LSTM pipeline where the pretrained model expects word tokens.
- The target language has reliable word boundary detection (e.g., English, most European languages).
- You need interpretable tokens for debugging and analysis.
Theoretical Basis
Base Tokenization Algorithm
Word tokenization relies on a linguistic model (typically spaCy) that applies the following rules:
FUNCTION word_tokenize(text):
doc = spacy_model(text)
tokens = []
FOR EACH token IN doc:
tokens.append(token.text)
RETURN tokens
Special Token Rules
After base tokenization, the fastai tokenizer applies a series of rules that insert special tokens. These rules preserve information that would otherwise be destroyed by lowercasing:
| Special Token | Meaning | Rule |
|---|---|---|
| xxbos | Beginning of stream | Inserted at the start of every document. Signals to the model that a new text begins. |
| xxmaj | Next word is capitalized | Inserted before any word that starts with an uppercase letter. The word itself is then lowercased. |
| xxup | Next word is all uppercase | Inserted before any word that is entirely uppercase (e.g., "AMAZING"). The word is then lowercased. |
| xxrep | Character repetition | Inserted when a single character is repeated more than 3 times (e.g., "!!!!!" becomes "xxrep 5 !"). |
| xxwrep | Word repetition | Inserted when a word is repeated multiple times consecutively (e.g., "very very very" becomes "xxwrep 3 very"). |
| xxunk | Unknown token | Used during numericalization for tokens not found in the vocabulary. |
| xxpad | Padding | Used to pad sequences to equal length within a batch. |
Tokenization Pipeline
The full tokenization pipeline can be expressed as:
FUNCTION full_tokenize(text):
# Step 1: Apply pre-rules (cleanup)
text = fix_html_entities(text)
text = replace_special_characters(text)
# Step 2: Base tokenization
tokens = word_tokenize(text)
# Step 3: Apply post-rules (special tokens)
tokens = insert_bos(tokens) # Add xxbos at start
tokens = handle_capitalization(tokens) # xxmaj, xxup
tokens = handle_repetitions(tokens) # xxrep, xxwrep
tokens = lowercase_all(tokens) # Lowercase everything
RETURN tokens
Why Special Tokens Matter
Consider the sentence: "This movie was AMAZING!!!"
Without special tokens, after lowercasing: ["this", "movie", "was", "amazing", "!", "!", "!"]
With special tokens: ["xxbos", "xxmaj", "this", "movie", "was", "xxup", "amazing", "xxrep", "3", "!"]
The special-token version preserves the emphasis conveyed by capitalization and punctuation repetition, which is critical for sentiment analysis tasks. The model can learn that xxup followed by a word often indicates strong sentiment.