Principle:Fastai Fastbook Tokenization

Knowledge Sources	Universal Language Model Fine-tuning for Text Classification (Howard & Ruder, 2018) Regularizing and Optimizing LSTM Language Models (Merity et al., 2017) Deep Learning for Coders with fastai and PyTorch, Chapter 10
Domains	Natural Language Processing, Text Preprocessing, Computational Linguistics
Last Updated	2026-02-09 17:00 GMT

Overview

Tokenization is the process of splitting raw text into a sequence of discrete units (tokens) that serve as the atomic input elements for language models and text classifiers.

Description

Tokenization bridges the gap between raw human-readable text and the numerical representations required by neural networks. It operates in two conceptual stages:

Base tokenization: Splitting the raw character stream into word-level or subword-level units using linguistic rules, statistical models, or both.
Token augmentation: Inserting special tokens that encode document boundaries, capitalization patterns, repetition, and other meta-information that would otherwise be lost during the lowercasing and normalization steps.

There are three major families of tokenization:

Word-level tokenization: Splits text on whitespace and punctuation boundaries. Simple and interpretable, but produces large vocabularies and cannot handle out-of-vocabulary (OOV) words gracefully.
Subword tokenization: Algorithms like Byte Pair Encoding (BPE), WordPiece, and SentencePiece split rare words into frequently occurring subword units. This balances vocabulary size against sequence length.
Character-level tokenization: Each character becomes a token. Produces the smallest vocabulary but the longest sequences, making training slow and long-range dependencies harder to learn.

The ULMFiT approach (Howard & Ruder, 2018) uses word-level tokenization via spaCy as the base tokenizer, augmented with a set of special tokens that preserve information typically lost during preprocessing.

Usage

Tokenization is required in every NLP pipeline between raw text input and numericalization. Use word-level tokenization when:

Working with the ULMFiT / AWD-LSTM pipeline where the pretrained model expects word tokens.
The target language has reliable word boundary detection (e.g., English, most European languages).
You need interpretable tokens for debugging and analysis.

Theoretical Basis

Base Tokenization Algorithm

Word tokenization relies on a linguistic model (typically spaCy) that applies the following rules:

FUNCTION word_tokenize(text):
    doc = spacy_model(text)
    tokens = []
    FOR EACH token IN doc:
        tokens.append(token.text)
    RETURN tokens

Special Token Rules

After base tokenization, the fastai tokenizer applies a series of rules that insert special tokens. These rules preserve information that would otherwise be destroyed by lowercasing:

Special Token	Meaning	Rule
xxbos	Beginning of stream	Inserted at the start of every document. Signals to the model that a new text begins.
xxmaj	Next word is capitalized	Inserted before any word that starts with an uppercase letter. The word itself is then lowercased.
xxup	Next word is all uppercase	Inserted before any word that is entirely uppercase (e.g., "AMAZING"). The word is then lowercased.
xxrep	Character repetition	Inserted when a single character is repeated more than 3 times (e.g., "!!!!!" becomes "xxrep 5 !").
xxwrep	Word repetition	Inserted when a word is repeated multiple times consecutively (e.g., "very very very" becomes "xxwrep 3 very").
xxunk	Unknown token	Used during numericalization for tokens not found in the vocabulary.
xxpad	Padding	Used to pad sequences to equal length within a batch.

Tokenization Pipeline

The full tokenization pipeline can be expressed as:

FUNCTION full_tokenize(text):
    # Step 1: Apply pre-rules (cleanup)
    text = fix_html_entities(text)
    text = replace_special_characters(text)

    # Step 2: Base tokenization
    tokens = word_tokenize(text)

    # Step 3: Apply post-rules (special tokens)
    tokens = insert_bos(tokens)           # Add xxbos at start
    tokens = handle_capitalization(tokens)  # xxmaj, xxup
    tokens = handle_repetitions(tokens)     # xxrep, xxwrep
    tokens = lowercase_all(tokens)          # Lowercase everything

    RETURN tokens

Why Special Tokens Matter

Consider the sentence: "This movie was AMAZING!!!"

Without special tokens, after lowercasing: ["this", "movie", "was", "amazing", "!", "!", "!"]

With special tokens: ["xxbos", "xxmaj", "this", "movie", "was", "xxup", "amazing", "xxrep", "3", "!"]

The special-token version preserves the emphasis conveyed by capitalization and punctuation repetition, which is critical for sentiment analysis tasks. The model can learn that xxup followed by a word often indicates strong sentiment.

Related Pages

Implemented By

Implementation:Fastai_Fastbook_Tokenizer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment