Principle:Ggml org Llama cpp Vocabulary System
| Knowledge Sources | |
|---|---|
| Domains | Tokenization, Vocabulary |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
The Vocabulary System is the principle of managing the token vocabulary, special tokens, and tokenizer configuration used by language models.
Description
This principle covers the vocabulary data structure that stores the mapping between token IDs and their text representations, along with tokenizer-specific metadata such as merge rules (for BPE), scores (for SentencePiece), special token definitions (BOS, EOS, padding, unknown), and pre-tokenization rules. The vocabulary is loaded from GGUF model metadata and used by the tokenization and detokenization pipelines.
Usage
Apply this principle when implementing tokenizer logic, handling special tokens in prompt construction, or when extending vocabulary support for new tokenizer types.
Theoretical Basis
Language model vocabularies define the discrete token set that the model operates over. Different tokenizer algorithms (BPE, SentencePiece, WordPiece) use different data structures: BPE requires merge rules that define how byte pairs are combined, SentencePiece uses scores to rank candidate tokenizations, and WordPiece uses a prefix-based lookup. The vocabulary system must also manage special tokens that have semantic meaning to the model (beginning/end of sequence, padding, unknown token) and ensure that these are correctly handled during both tokenization and generation. The header defines the data structures and interface used throughout the codebase to access vocabulary information.