Principle:LaurentMazare Tch rs Vocabulary Management
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Data Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Vocabulary management maintains a bidirectional mapping between words and integer indices, handles special tokens for sequence boundaries, and supports frequency-based vocabulary pruning.
Description
A vocabulary is a fundamental data structure in natural language processing that bridges the gap between raw text and the numerical representations required by neural networks. Effective vocabulary management involves:
- Word-to-index mapping: Each unique word (or token) in the training corpus is assigned a unique integer index. This mapping must be deterministic and consistent across training and inference. The mapping is typically stored as a hash table for lookup during tokenization.
- Index-to-word mapping: The inverse mapping allows converting model outputs (integer indices or probability distributions over indices) back into human-readable text. This is essential for decoding model predictions during inference and evaluation.
- Special tokens: The vocabulary includes reserved tokens with specific semantic roles:
- SOS (Start of Sequence): Typically index 0. Fed as the first input to the decoder, signaling the start of generation.
- EOS (End of Sequence): Typically index 1. The decoder is trained to output this token when it has finished generating the sequence. During inference, generation stops when EOS is produced.
- UNK (Unknown): Maps out-of-vocabulary words encountered during inference to a known index.
- PAD (Padding): Used to fill sequences to uniform length within a batch.
- Word frequency tracking: During vocabulary construction, the frequency of each word is recorded. This enables vocabulary pruning: words appearing fewer than a threshold number of times can be excluded from the vocabulary and mapped to UNK. This reduces the embedding table size and improves generalization by preventing the model from overfitting to rare words.
Usage
Vocabulary management is required in any NLP system that processes discrete text tokens, including machine translation, text classification, language modeling, and question answering. It is typically one of the first steps in any NLP data pipeline and must be performed consistently across all stages of model development.
Theoretical Basis
Formal Definition:
A vocabulary is a bijective mapping:
with special tokens occupying reserved indices:
Vocabulary Construction:
Given a corpus of tokenized sentences:
INITIALIZE:
word_count := empty map
word_to_index := {SOS: 0, EOS: 1}
index_to_word := {0: SOS, 1: EOS}
next_index := 2
BUILD(corpus C):
for each sentence s in C:
for each word w in s:
word_count[w] := word_count[w] + 1
if w not in word_to_index:
word_to_index[w] := next_index
index_to_word[next_index] := w
next_index := next_index + 1
Vocabulary Pruning:
Given a frequency threshold , the pruned vocabulary retains only frequent words:
Words excluded by pruning are mapped to UNK during encoding:
Zipf's Law and Vocabulary Size:
Word frequencies in natural language approximately follow Zipf's law:
where is the rank and . This means a large fraction of unique words appear very infrequently. Pruning eliminates the long tail of rare words, often removing a large portion of the vocabulary while affecting only a small fraction of the total tokens in the corpus.
Embedding Relationship:
The vocabulary size directly determines the embedding table dimensions:
where is the embedding dimension. Reducing through pruning proportionally reduces the number of parameters in the embedding layer, with memory savings of parameters.