Principle:Fastai Fastbook Numericalization

Knowledge Sources	Universal Language Model Fine-tuning for Text Classification (Howard & Ruder, 2018) Regularizing and Optimizing LSTM Language Models (Merity et al., 2017) Deep Learning for Coders with fastai and PyTorch, Chapter 10
Domains	Natural Language Processing, Feature Engineering, Text Preprocessing
Last Updated	2026-02-09 17:00 GMT

Overview

Numericalization is the process of mapping discrete token strings to unique integer indices via a fixed vocabulary, converting text sequences into numerical tensors suitable for neural network input.

Description

Neural networks operate on numerical tensors, not strings. Numericalization bridges this gap by:

Building a vocabulary: Scanning the entire training corpus to count token frequencies, then selecting the most common tokens up to a maximum vocabulary size, filtering out tokens that appear fewer than a minimum number of times.
Mapping tokens to integers: Assigning each vocabulary token a unique integer index. Special tokens (xxbos, xxpad, xxunk, etc.) are placed at the beginning of the vocabulary with reserved indices.
Encoding sequences: Replacing each token in a tokenized text with its corresponding integer index. Tokens not found in the vocabulary are replaced with the xxunk (unknown) index.

The vocabulary is an ordered list where position equals index. For example, if vocab[5] = "the", then every occurrence of "the" in the corpus is replaced by the integer 5.

Usage

Numericalization is required after tokenization and before any model training or data loading. Use this step when:

Converting tokenized text into a format that can be embedded by a neural network's embedding layer.
You need to control vocabulary size to manage model complexity and memory usage.
Working with the ULMFiT pipeline where the vocabulary must be shared between the language model and the downstream classifier.

Critical note: The vocabulary built during language model training must be reused for classifier training. If the classifier uses a different vocabulary, the pretrained embeddings will be misaligned.

Theoretical Basis

Vocabulary Construction

FUNCTION build_vocabulary(tokenized_corpus, min_freq, max_vocab):
    # Step 1: Count all token frequencies
    freq = Counter()
    FOR EACH document IN tokenized_corpus:
        FOR EACH token IN document:
            freq[token] += 1

    # Step 2: Filter by minimum frequency
    candidates = {t: c FOR (t, c) IN freq.items() IF c >= min_freq}

    # Step 3: Sort by frequency (descending) and truncate
    sorted_tokens = sort_by_frequency(candidates, descending=True)
    vocab_tokens = sorted_tokens[:max_vocab]

    # Step 4: Prepend special tokens
    special = ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep',
               'xxwrep', 'xxup', 'xxmaj']
    vocab = special + vocab_tokens

    # Step 5: Build reverse mapping
    token_to_index = {token: idx FOR idx, token IN enumerate(vocab)}

    RETURN vocab, token_to_index

Numericalization Encoding

FUNCTION numericalize(tokens, token_to_index, unk_index=0):
    indices = []
    FOR EACH token IN tokens:
        IF token IN token_to_index:
            indices.append(token_to_index[token])
        ELSE:
            indices.append(unk_index)  # Map unknown tokens to xxunk
    RETURN tensor(indices)

Key Design Parameters

Parameter	Typical Value	Effect
min_freq	3	Tokens appearing fewer than 3 times are excluded and mapped to xxunk at inference time. This reduces noise from rare misspellings and proper nouns.
max_vocab	60,000	Caps vocabulary size at 60,000 tokens. The embedding matrix has shape (max_vocab, embedding_dim), so this directly controls model size.

Impact on Model Architecture

The vocabulary size V determines:

The embedding matrix dimensions: V x d where d is the embedding dimension (typically 400 for AWD-LSTM).
The output projection layer dimensions for language models: d x V.
The memory footprint: Each embedding vector requires d x 4 bytes (float32), so the total embedding memory is V x d x 4 bytes.

For V = 60,000 and d = 400: embedding memory = 60,000 x 400 x 4 = 96 MB.

Related Pages

Implemented By

Implementation:Fastai_Fastbook_Numericalize

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment