Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LaurentMazare Tch rs Vocabulary Management

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Data Engineering
Last Updated 2026-02-08 00:00 GMT

Overview

Vocabulary management maintains a bidirectional mapping between words and integer indices, handles special tokens for sequence boundaries, and supports frequency-based vocabulary pruning.

Description

A vocabulary is a fundamental data structure in natural language processing that bridges the gap between raw text and the numerical representations required by neural networks. Effective vocabulary management involves:

  • Word-to-index mapping: Each unique word (or token) in the training corpus is assigned a unique integer index. This mapping must be deterministic and consistent across training and inference. The mapping is typically stored as a hash table for O(1) lookup during tokenization.
  • Index-to-word mapping: The inverse mapping allows converting model outputs (integer indices or probability distributions over indices) back into human-readable text. This is essential for decoding model predictions during inference and evaluation.
  • Special tokens: The vocabulary includes reserved tokens with specific semantic roles:
    • SOS (Start of Sequence): Typically index 0. Fed as the first input to the decoder, signaling the start of generation.
    • EOS (End of Sequence): Typically index 1. The decoder is trained to output this token when it has finished generating the sequence. During inference, generation stops when EOS is produced.
    • UNK (Unknown): Maps out-of-vocabulary words encountered during inference to a known index.
    • PAD (Padding): Used to fill sequences to uniform length within a batch.
  • Word frequency tracking: During vocabulary construction, the frequency of each word is recorded. This enables vocabulary pruning: words appearing fewer than a threshold number of times can be excluded from the vocabulary and mapped to UNK. This reduces the embedding table size and improves generalization by preventing the model from overfitting to rare words.

Usage

Vocabulary management is required in any NLP system that processes discrete text tokens, including machine translation, text classification, language modeling, and question answering. It is typically one of the first steps in any NLP data pipeline and must be performed consistently across all stages of model development.

Theoretical Basis

Formal Definition:

A vocabulary V is a bijective mapping:

V:{w1,w2,,w|V|}{0,1,,|V|1}

with special tokens occupying reserved indices:

V(SOS)=0,V(EOS)=1

Vocabulary Construction:

Given a corpus C of tokenized sentences:

INITIALIZE:
    word_count := empty map
    word_to_index := {SOS: 0, EOS: 1}
    index_to_word := {0: SOS, 1: EOS}
    next_index := 2
BUILD(corpus C):
    for each sentence s in C:
        for each word w in s:
            word_count[w] := word_count[w] + 1
            if w not in word_to_index:
                word_to_index[w] := next_index
                index_to_word[next_index] := w
                next_index := next_index + 1

Vocabulary Pruning:

Given a frequency threshold k, the pruned vocabulary retains only frequent words:

Vpruned={wV:count(w)k}{SOS,EOS,UNK}

Words excluded by pruning are mapped to UNK during encoding:

encode(w)={Vpruned(w)if wVprunedVpruned(UNK)otherwise

Zipf's Law and Vocabulary Size:

Word frequencies in natural language approximately follow Zipf's law:

f(r)1rα

where r is the rank and α1. This means a large fraction of unique words appear very infrequently. Pruning eliminates the long tail of rare words, often removing a large portion of the vocabulary while affecting only a small fraction of the total tokens in the corpus.

Embedding Relationship:

The vocabulary size |V| directly determines the embedding table dimensions:

E|V|×d

where d is the embedding dimension. Reducing |V| through pruning proportionally reduces the number of parameters in the embedding layer, with memory savings of Δ|V|×d parameters.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment