Implementation:Fastai Fastbook Numericalize

Knowledge Sources	fastbook fastai docs fastai.text.core
Domains	Natural Language Processing, Feature Engineering
Last Updated	2026-02-09 17:00 GMT

Overview

Concrete tool for building a token vocabulary and converting token sequences into integer tensors, provided by the fastai library.

Description

The Numericalize class is a fastai Transform that handles two phases:

Setup phase (Numericalize.setup): Scans a collection of tokenized documents to build a vocabulary. It counts token frequencies, filters by min_freq, truncates to max_vocab, and prepends special tokens. The resulting vocabulary is stored as the .vocab attribute.
Encode phase (__call__ / encodes): Converts a list of token strings into a TensorText of integer indices by looking up each token in the vocabulary. Unknown tokens map to index 0 (xxunk).
Decode phase (decodes): Converts integer indices back to token strings for human-readable inspection.

Numericalize is typically used as part of a Pipeline within the fastai DataBlock API, but can also be used standalone for manual inspection and debugging.

Usage

Use Numericalize after tokenization is complete and before constructing data loaders. It is essential to call setup() on the training data before encoding, and the same Numericalize instance (with its learned vocabulary) must be reused for validation, test, and classifier data.

Code Reference

Source Location

Repository: fastbook
File: translations/cn/10_nlp.md (lines 353-370)
Library module: fastai.text.core

Signature

class Numericalize(Transform):
    "Transform that maps token strings to integer indices via a vocabulary"
    def __init__(
        self,
        min_freq: int = 3,      # Minimum token frequency to include in vocab
        max_vocab: int = 60000,  # Maximum vocabulary size
        special: list = None     # Special tokens (defaults to fastai defaults)
    ):
        ...

    def setup(
        self,
        items: list = None       # Collection of tokenized texts to build vocab from
    ):
        ...

    def encodes(self, o: list) -> TensorText:
        "Convert token list to integer tensor"
        ...

    def decodes(self, o: TensorText) -> list:
        "Convert integer tensor back to token list"
        ...

Import

from fastai.text.all import Numericalize

I/O Contract

Inputs

Name	Type	Required	Description
min_freq	int	No	Minimum number of times a token must appear to be included in the vocabulary. Default: 3.
max_vocab	int	No	Maximum number of tokens in the vocabulary (excluding special tokens). Default: 60,000.
special	list	No	List of special token strings to prepend to the vocabulary. Defaults to ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj'].
items	list of list of str	Yes (for setup)	Collection of tokenized texts used to build the vocabulary during the setup call.
o	list of str	Yes (for encodes)	A single tokenized text (list of token strings) to convert to integer indices.

Outputs

Name	Type	Description
vocab	list of str	The ordered vocabulary list. vocab[i] gives the token string for index i. Accessible as .vocab attribute after setup().
encoded	TensorText	A 1-D integer tensor of token indices. Each element is the vocabulary index of the corresponding token.
decoded	list of str	Token strings recovered from integer indices (via decodes).

Usage Examples

Basic Usage

from fastai.text.all import Numericalize, Tokenizer, WordTokenizer

# Tokenize some sample texts
tok = Tokenizer(WordTokenizer())
texts = [
    "This movie was great and I loved it.",
    "This movie was terrible and I hated it.",
    "A wonderful film with great acting."
]
tokenized = tok(texts)

# Build vocabulary from tokenized texts
num = Numericalize(min_freq=1, max_vocab=60000)
num.setup(tokenized)

# Inspect the vocabulary
print(num.vocab[:15])
# Output: ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep',
#          'xxup', 'xxmaj', 'this', 'movie', 'was', 'and', 'i', 'it']

print(f"Vocabulary size: {len(num.vocab)}")

Encoding and Decoding

from fastai.text.all import Numericalize, Tokenizer, WordTokenizer

tok = Tokenizer(WordTokenizer())
texts = ["The movie was excellent.", "The movie was terrible."]
tokenized = tok(texts)

num = Numericalize(min_freq=1)
num.setup(tokenized)

# Encode a tokenized text to integers
encoded = num(tokenized[0])
print(encoded)
# Output: TensorText([2, 8, 9, 10, 11, 12, 13])

# Decode back to tokens
decoded = num.decode(encoded)
print(decoded)
# Output: ['xxbos', 'xxmaj', 'the', 'movie', 'was', 'excellent', '.']

Controlling Vocabulary Size

from fastai.text.all import Numericalize

# With min_freq=3, only tokens appearing 3+ times are included
num_strict = Numericalize(min_freq=3, max_vocab=60000)
num_strict.setup(all_tokenized_texts)

print(f"Vocab size with min_freq=3: {len(num_strict.vocab)}")

# Rare tokens are mapped to xxunk (index 0)
rare_token_idx = num_strict(["xxbos", "xxmaj", "supercalifragilistic"])
print(rare_token_idx)
# Output: TensorText([2, 8, 0])  # 0 = xxunk for the rare word

Sharing Vocabulary Between LM and Classifier

# After building language model DataLoaders
# dls_lm = TextBlock.from_folder(path, is_lm=True)...

# Access the vocabulary from the language model DataLoaders
lm_vocab = dls_lm.vocab

# Reuse it for the classifier DataLoaders
# This ensures token-to-index mapping is consistent
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=lm_vocab), CategoryBlock),
    ...
).dataloaders(path)

Related Pages

Implements Principle

Principle:Fastai_Fastbook_Numericalization

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment