Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Fastai Fastbook Tokenization

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Text Preprocessing, Computational Linguistics
Last Updated 2026-02-09 17:00 GMT

Overview

Tokenization is the process of splitting raw text into a sequence of discrete units (tokens) that serve as the atomic input elements for language models and text classifiers.

Description

Tokenization bridges the gap between raw human-readable text and the numerical representations required by neural networks. It operates in two conceptual stages:

  1. Base tokenization: Splitting the raw character stream into word-level or subword-level units using linguistic rules, statistical models, or both.
  2. Token augmentation: Inserting special tokens that encode document boundaries, capitalization patterns, repetition, and other meta-information that would otherwise be lost during the lowercasing and normalization steps.

There are three major families of tokenization:

  • Word-level tokenization: Splits text on whitespace and punctuation boundaries. Simple and interpretable, but produces large vocabularies and cannot handle out-of-vocabulary (OOV) words gracefully.
  • Subword tokenization: Algorithms like Byte Pair Encoding (BPE), WordPiece, and SentencePiece split rare words into frequently occurring subword units. This balances vocabulary size against sequence length.
  • Character-level tokenization: Each character becomes a token. Produces the smallest vocabulary but the longest sequences, making training slow and long-range dependencies harder to learn.

The ULMFiT approach (Howard & Ruder, 2018) uses word-level tokenization via spaCy as the base tokenizer, augmented with a set of special tokens that preserve information typically lost during preprocessing.

Usage

Tokenization is required in every NLP pipeline between raw text input and numericalization. Use word-level tokenization when:

  • Working with the ULMFiT / AWD-LSTM pipeline where the pretrained model expects word tokens.
  • The target language has reliable word boundary detection (e.g., English, most European languages).
  • You need interpretable tokens for debugging and analysis.

Theoretical Basis

Base Tokenization Algorithm

Word tokenization relies on a linguistic model (typically spaCy) that applies the following rules:

FUNCTION word_tokenize(text):
    doc = spacy_model(text)
    tokens = []
    FOR EACH token IN doc:
        tokens.append(token.text)
    RETURN tokens

Special Token Rules

After base tokenization, the fastai tokenizer applies a series of rules that insert special tokens. These rules preserve information that would otherwise be destroyed by lowercasing:

Special Token Meaning Rule
xxbos Beginning of stream Inserted at the start of every document. Signals to the model that a new text begins.
xxmaj Next word is capitalized Inserted before any word that starts with an uppercase letter. The word itself is then lowercased.
xxup Next word is all uppercase Inserted before any word that is entirely uppercase (e.g., "AMAZING"). The word is then lowercased.
xxrep Character repetition Inserted when a single character is repeated more than 3 times (e.g., "!!!!!" becomes "xxrep 5 !").
xxwrep Word repetition Inserted when a word is repeated multiple times consecutively (e.g., "very very very" becomes "xxwrep 3 very").
xxunk Unknown token Used during numericalization for tokens not found in the vocabulary.
xxpad Padding Used to pad sequences to equal length within a batch.

Tokenization Pipeline

The full tokenization pipeline can be expressed as:

FUNCTION full_tokenize(text):
    # Step 1: Apply pre-rules (cleanup)
    text = fix_html_entities(text)
    text = replace_special_characters(text)

    # Step 2: Base tokenization
    tokens = word_tokenize(text)

    # Step 3: Apply post-rules (special tokens)
    tokens = insert_bos(tokens)           # Add xxbos at start
    tokens = handle_capitalization(tokens)  # xxmaj, xxup
    tokens = handle_repetitions(tokens)     # xxrep, xxwrep
    tokens = lowercase_all(tokens)          # Lowercase everything

    RETURN tokens

Why Special Tokens Matter

Consider the sentence: "This movie was AMAZING!!!"

Without special tokens, after lowercasing: ["this", "movie", "was", "amazing", "!", "!", "!"]

With special tokens: ["xxbos", "xxmaj", "this", "movie", "was", "xxup", "amazing", "xxrep", "3", "!"]

The special-token version preserves the emphasis conveyed by capitalization and punctuation repetition, which is critical for sentiment analysis tasks. The model can learn that xxup followed by a word often indicates strong sentiment.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment