Implementation:Fastai Fastbook Numericalize
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Feature Engineering |
| Last Updated | 2026-02-09 17:00 GMT |
Overview
Concrete tool for building a token vocabulary and converting token sequences into integer tensors, provided by the fastai library.
Description
The Numericalize class is a fastai Transform that handles two phases:
- Setup phase (Numericalize.setup): Scans a collection of tokenized documents to build a vocabulary. It counts token frequencies, filters by min_freq, truncates to max_vocab, and prepends special tokens. The resulting vocabulary is stored as the .vocab attribute.
- Encode phase (__call__ / encodes): Converts a list of token strings into a TensorText of integer indices by looking up each token in the vocabulary. Unknown tokens map to index 0 (xxunk).
- Decode phase (decodes): Converts integer indices back to token strings for human-readable inspection.
Numericalize is typically used as part of a Pipeline within the fastai DataBlock API, but can also be used standalone for manual inspection and debugging.
Usage
Use Numericalize after tokenization is complete and before constructing data loaders. It is essential to call setup() on the training data before encoding, and the same Numericalize instance (with its learned vocabulary) must be reused for validation, test, and classifier data.
Code Reference
Source Location
- Repository: fastbook
- File: translations/cn/10_nlp.md (lines 353-370)
- Library module: fastai.text.core
Signature
class Numericalize(Transform):
"Transform that maps token strings to integer indices via a vocabulary"
def __init__(
self,
min_freq: int = 3, # Minimum token frequency to include in vocab
max_vocab: int = 60000, # Maximum vocabulary size
special: list = None # Special tokens (defaults to fastai defaults)
):
...
def setup(
self,
items: list = None # Collection of tokenized texts to build vocab from
):
...
def encodes(self, o: list) -> TensorText:
"Convert token list to integer tensor"
...
def decodes(self, o: TensorText) -> list:
"Convert integer tensor back to token list"
...
Import
from fastai.text.all import Numericalize
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| min_freq | int | No | Minimum number of times a token must appear to be included in the vocabulary. Default: 3. |
| max_vocab | int | No | Maximum number of tokens in the vocabulary (excluding special tokens). Default: 60,000. |
| special | list | No | List of special token strings to prepend to the vocabulary. Defaults to ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj']. |
| items | list of list of str | Yes (for setup) | Collection of tokenized texts used to build the vocabulary during the setup call. |
| o | list of str | Yes (for encodes) | A single tokenized text (list of token strings) to convert to integer indices. |
Outputs
| Name | Type | Description |
|---|---|---|
| vocab | list of str | The ordered vocabulary list. vocab[i] gives the token string for index i. Accessible as .vocab attribute after setup(). |
| encoded | TensorText | A 1-D integer tensor of token indices. Each element is the vocabulary index of the corresponding token. |
| decoded | list of str | Token strings recovered from integer indices (via decodes). |
Usage Examples
Basic Usage
from fastai.text.all import Numericalize, Tokenizer, WordTokenizer
# Tokenize some sample texts
tok = Tokenizer(WordTokenizer())
texts = [
"This movie was great and I loved it.",
"This movie was terrible and I hated it.",
"A wonderful film with great acting."
]
tokenized = tok(texts)
# Build vocabulary from tokenized texts
num = Numericalize(min_freq=1, max_vocab=60000)
num.setup(tokenized)
# Inspect the vocabulary
print(num.vocab[:15])
# Output: ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep',
# 'xxup', 'xxmaj', 'this', 'movie', 'was', 'and', 'i', 'it']
print(f"Vocabulary size: {len(num.vocab)}")
Encoding and Decoding
from fastai.text.all import Numericalize, Tokenizer, WordTokenizer
tok = Tokenizer(WordTokenizer())
texts = ["The movie was excellent.", "The movie was terrible."]
tokenized = tok(texts)
num = Numericalize(min_freq=1)
num.setup(tokenized)
# Encode a tokenized text to integers
encoded = num(tokenized[0])
print(encoded)
# Output: TensorText([2, 8, 9, 10, 11, 12, 13])
# Decode back to tokens
decoded = num.decode(encoded)
print(decoded)
# Output: ['xxbos', 'xxmaj', 'the', 'movie', 'was', 'excellent', '.']
Controlling Vocabulary Size
from fastai.text.all import Numericalize
# With min_freq=3, only tokens appearing 3+ times are included
num_strict = Numericalize(min_freq=3, max_vocab=60000)
num_strict.setup(all_tokenized_texts)
print(f"Vocab size with min_freq=3: {len(num_strict.vocab)}")
# Rare tokens are mapped to xxunk (index 0)
rare_token_idx = num_strict(["xxbos", "xxmaj", "supercalifragilistic"])
print(rare_token_idx)
# Output: TensorText([2, 8, 0]) # 0 = xxunk for the rare word
Sharing Vocabulary Between LM and Classifier
# After building language model DataLoaders
# dls_lm = TextBlock.from_folder(path, is_lm=True)...
# Access the vocabulary from the language model DataLoaders
lm_vocab = dls_lm.vocab
# Reuse it for the classifier DataLoaders
# This ensures token-to-index mapping is consistent
dls_clas = DataBlock(
blocks=(TextBlock.from_folder(path, vocab=lm_vocab), CategoryBlock),
...
).dataloaders(path)