Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Fastai Fastbook Numericalize

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Feature Engineering
Last Updated 2026-02-09 17:00 GMT

Overview

Concrete tool for building a token vocabulary and converting token sequences into integer tensors, provided by the fastai library.

Description

The Numericalize class is a fastai Transform that handles two phases:

  • Setup phase (Numericalize.setup): Scans a collection of tokenized documents to build a vocabulary. It counts token frequencies, filters by min_freq, truncates to max_vocab, and prepends special tokens. The resulting vocabulary is stored as the .vocab attribute.
  • Encode phase (__call__ / encodes): Converts a list of token strings into a TensorText of integer indices by looking up each token in the vocabulary. Unknown tokens map to index 0 (xxunk).
  • Decode phase (decodes): Converts integer indices back to token strings for human-readable inspection.

Numericalize is typically used as part of a Pipeline within the fastai DataBlock API, but can also be used standalone for manual inspection and debugging.

Usage

Use Numericalize after tokenization is complete and before constructing data loaders. It is essential to call setup() on the training data before encoding, and the same Numericalize instance (with its learned vocabulary) must be reused for validation, test, and classifier data.

Code Reference

Source Location

  • Repository: fastbook
  • File: translations/cn/10_nlp.md (lines 353-370)
  • Library module: fastai.text.core

Signature

class Numericalize(Transform):
    "Transform that maps token strings to integer indices via a vocabulary"
    def __init__(
        self,
        min_freq: int = 3,      # Minimum token frequency to include in vocab
        max_vocab: int = 60000,  # Maximum vocabulary size
        special: list = None     # Special tokens (defaults to fastai defaults)
    ):
        ...

    def setup(
        self,
        items: list = None       # Collection of tokenized texts to build vocab from
    ):
        ...

    def encodes(self, o: list) -> TensorText:
        "Convert token list to integer tensor"
        ...

    def decodes(self, o: TensorText) -> list:
        "Convert integer tensor back to token list"
        ...

Import

from fastai.text.all import Numericalize

I/O Contract

Inputs

Name Type Required Description
min_freq int No Minimum number of times a token must appear to be included in the vocabulary. Default: 3.
max_vocab int No Maximum number of tokens in the vocabulary (excluding special tokens). Default: 60,000.
special list No List of special token strings to prepend to the vocabulary. Defaults to ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep', 'xxup', 'xxmaj'].
items list of list of str Yes (for setup) Collection of tokenized texts used to build the vocabulary during the setup call.
o list of str Yes (for encodes) A single tokenized text (list of token strings) to convert to integer indices.

Outputs

Name Type Description
vocab list of str The ordered vocabulary list. vocab[i] gives the token string for index i. Accessible as .vocab attribute after setup().
encoded TensorText A 1-D integer tensor of token indices. Each element is the vocabulary index of the corresponding token.
decoded list of str Token strings recovered from integer indices (via decodes).

Usage Examples

Basic Usage

from fastai.text.all import Numericalize, Tokenizer, WordTokenizer

# Tokenize some sample texts
tok = Tokenizer(WordTokenizer())
texts = [
    "This movie was great and I loved it.",
    "This movie was terrible and I hated it.",
    "A wonderful film with great acting."
]
tokenized = tok(texts)

# Build vocabulary from tokenized texts
num = Numericalize(min_freq=1, max_vocab=60000)
num.setup(tokenized)

# Inspect the vocabulary
print(num.vocab[:15])
# Output: ['xxunk', 'xxpad', 'xxbos', 'xxeos', 'xxfld', 'xxrep', 'xxwrep',
#          'xxup', 'xxmaj', 'this', 'movie', 'was', 'and', 'i', 'it']

print(f"Vocabulary size: {len(num.vocab)}")

Encoding and Decoding

from fastai.text.all import Numericalize, Tokenizer, WordTokenizer

tok = Tokenizer(WordTokenizer())
texts = ["The movie was excellent.", "The movie was terrible."]
tokenized = tok(texts)

num = Numericalize(min_freq=1)
num.setup(tokenized)

# Encode a tokenized text to integers
encoded = num(tokenized[0])
print(encoded)
# Output: TensorText([2, 8, 9, 10, 11, 12, 13])

# Decode back to tokens
decoded = num.decode(encoded)
print(decoded)
# Output: ['xxbos', 'xxmaj', 'the', 'movie', 'was', 'excellent', '.']

Controlling Vocabulary Size

from fastai.text.all import Numericalize

# With min_freq=3, only tokens appearing 3+ times are included
num_strict = Numericalize(min_freq=3, max_vocab=60000)
num_strict.setup(all_tokenized_texts)

print(f"Vocab size with min_freq=3: {len(num_strict.vocab)}")

# Rare tokens are mapped to xxunk (index 0)
rare_token_idx = num_strict(["xxbos", "xxmaj", "supercalifragilistic"])
print(rare_token_idx)
# Output: TensorText([2, 8, 0])  # 0 = xxunk for the rare word

Sharing Vocabulary Between LM and Classifier

# After building language model DataLoaders
# dls_lm = TextBlock.from_folder(path, is_lm=True)...

# Access the vocabulary from the language model DataLoaders
lm_vocab = dls_lm.vocab

# Reuse it for the classifier DataLoaders
# This ensures token-to-index mapping is consistent
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=lm_vocab), CategoryBlock),
    ...
).dataloaders(path)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment