Principle:LaurentMazare Tch rs Vocabulary Management

Knowledge Sources	LaurentMazare_Tch_rs
Domains	Natural Language Processing, Data Engineering
Last Updated	2026-02-08 00:00 GMT

Overview

Vocabulary management maintains a bidirectional mapping between words and integer indices, handles special tokens for sequence boundaries, and supports frequency-based vocabulary pruning.

Description

A vocabulary is a fundamental data structure in natural language processing that bridges the gap between raw text and the numerical representations required by neural networks. Effective vocabulary management involves:

Word-to-index mapping: Each unique word (or token) in the training corpus is assigned a unique integer index. This mapping must be deterministic and consistent across training and inference. The mapping is typically stored as a hash table for $O (1)$ lookup during tokenization.

Index-to-word mapping: The inverse mapping allows converting model outputs (integer indices or probability distributions over indices) back into human-readable text. This is essential for decoding model predictions during inference and evaluation.

Special tokens: The vocabulary includes reserved tokens with specific semantic roles:
- SOS (Start of Sequence): Typically index 0. Fed as the first input to the decoder, signaling the start of generation.
- EOS (End of Sequence): Typically index 1. The decoder is trained to output this token when it has finished generating the sequence. During inference, generation stops when EOS is produced.
- UNK (Unknown): Maps out-of-vocabulary words encountered during inference to a known index.
- PAD (Padding): Used to fill sequences to uniform length within a batch.

Word frequency tracking: During vocabulary construction, the frequency of each word is recorded. This enables vocabulary pruning: words appearing fewer than a threshold number of times can be excluded from the vocabulary and mapped to UNK. This reduces the embedding table size and improves generalization by preventing the model from overfitting to rare words.

Usage

Vocabulary management is required in any NLP system that processes discrete text tokens, including machine translation, text classification, language modeling, and question answering. It is typically one of the first steps in any NLP data pipeline and must be performed consistently across all stages of model development.

Theoretical Basis

Formal Definition:

A vocabulary $V$ is a bijective mapping:

$V : {w_{1}, w_{2}, \dots, w_{| V |}} \leftrightarrow {0, 1, \dots, | V | - 1}$

with special tokens occupying reserved indices:

$V (SOS) = 0, V (EOS) = 1$

Vocabulary Construction:

Given a corpus $C$ of tokenized sentences:

INITIALIZE:
    word_count := empty map
    word_to_index := {SOS: 0, EOS: 1}
    index_to_word := {0: SOS, 1: EOS}
    next_index := 2

BUILD(corpus C):
    for each sentence s in C:
        for each word w in s:
            word_count[w] := word_count[w] + 1
            if w not in word_to_index:
                word_to_index[w] := next_index
                index_to_word[next_index] := w
                next_index := next_index + 1

Vocabulary Pruning:

Given a frequency threshold $k$ , the pruned vocabulary retains only frequent words:

$V_{p r u n e d} = {w \in V : count (w) \geq k} \cup {SOS, EOS, UNK}$

Words excluded by pruning are mapped to UNK during encoding:

$encode (w) = {\begin{cases} V_{p r u n e d} (w) & if w \in V_{p r u n e d} \\ V_{p r u n e d} (UNK) & otherwise \end{cases}$

Zipf's Law and Vocabulary Size:

Word frequencies in natural language approximately follow Zipf's law:

$f (r) \propto \frac{1}{r^{α}}$

where $r$ is the rank and $α \approx 1$ . This means a large fraction of unique words appear very infrequently. Pruning eliminates the long tail of rare words, often removing a large portion of the vocabulary while affecting only a small fraction of the total tokens in the corpus.

Embedding Relationship:

The vocabulary size $| V |$ directly determines the embedding table dimensions:

$E \in ℝ^{| V | \times d}$

where $d$ is the embedding dimension. Reducing $| V |$ through pruning proportionally reduces the number of parameters in the embedding layer, with memory savings of $Δ | V | \times d$ parameters.

Related Pages

Implementation:LaurentMazare_Tch_rs_Translation_Lang

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment