Principle:Fastai Fastbook Numericalization
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Feature Engineering, Text Preprocessing |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Numericalization is the process of mapping discrete token strings to unique integer indices via a fixed vocabulary, converting text sequences into numerical tensors suitable for neural network input.
Description
Neural networks operate on numerical tensors, not strings. Numericalization bridges this gap by:
- Building a vocabulary: Scanning the entire training corpus to count token frequencies, then selecting the most common tokens up to a maximum vocabulary size, filtering out tokens that appear fewer than a minimum number of times.
- Mapping tokens to integers: Assigning each vocabulary token a unique integer index. Special tokens (xxbos, xxpad, xxunk, etc.) are placed at the beginning of the vocabulary with reserved indices.
- Encoding sequences: Replacing each token in a tokenized text with its corresponding integer index. Tokens not found in the vocabulary are replaced with the xxunk (unknown) index.
The vocabulary is an ordered list where position equals index. For example, if vocab[5] = "the", then every occurrence of "the" in the corpus is replaced by the integer 5.
Usage
Numericalization is required after tokenization and before any model training or data loading. Use this step when:
- Converting tokenized text into a format that can be embedded by a neural network's embedding layer.
- You need to control vocabulary size to manage model complexity and memory usage.
- Working with the ULMFiT pipeline where the vocabulary must be shared between the language model and the downstream classifier.
Critical note: The vocabulary built during language model training must be reused for classifier training. If the classifier uses a different vocabulary, the pretrained embeddings will be misaligned.
Theoretical Basis
Vocabulary Construction
FUNCTION build_vocabulary(tokenized_corpus, min_freq, max_vocab):
# Step 1: Count all token frequencies
freq = Counter()
FOR EACH document IN tokenized_corpus:
FOR EACH token IN document:
freq[token] += 1
# Step 2: Filter by minimum frequency
candidates = {t: c FOR (t, c) IN freq.items() IF c >= min_freq}
# Step 3: Sort by frequency (descending) and truncate
sorted_tokens = sort_by_frequency(candidates, descending=True)
vocab_tokens = sorted_tokens[:max_vocab]
# Step 4: Prepend special tokens
special = ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep',
'xxwrep', 'xxup', 'xxmaj']
vocab = special + vocab_tokens
# Step 5: Build reverse mapping
token_to_index = {token: idx FOR idx, token IN enumerate(vocab)}
RETURN vocab, token_to_index
Numericalization Encoding
FUNCTION numericalize(tokens, token_to_index, unk_index=0):
indices = []
FOR EACH token IN tokens:
IF token IN token_to_index:
indices.append(token_to_index[token])
ELSE:
indices.append(unk_index) # Map unknown tokens to xxunk
RETURN tensor(indices)
Key Design Parameters
| Parameter | Typical Value | Effect |
|---|---|---|
| min_freq | 3 | Tokens appearing fewer than 3 times are excluded and mapped to xxunk at inference time. This reduces noise from rare misspellings and proper nouns. |
| max_vocab | 60,000 | Caps vocabulary size at 60,000 tokens. The embedding matrix has shape (max_vocab, embedding_dim), so this directly controls model size. |
Impact on Model Architecture
The vocabulary size V determines:
- The embedding matrix dimensions: V x d where d is the embedding dimension (typically 400 for AWD-LSTM).
- The output projection layer dimensions for language models: d x V.
- The memory footprint: Each embedding vector requires d x 4 bytes (float32), so the total embedding memory is V x d x 4 bytes.
For V = 60,000 and d = 400: embedding memory = 60,000 x 400 x 4 = 96 MB.