Principle:Ggml org Llama cpp Vocabulary System

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Tokenization, Vocabulary
Last Updated	2026-02-15 00:00 GMT

Overview

The Vocabulary System is the principle of managing the token vocabulary, special tokens, and tokenizer configuration used by language models.

Description

This principle covers the vocabulary data structure that stores the mapping between token IDs and their text representations, along with tokenizer-specific metadata such as merge rules (for BPE), scores (for SentencePiece), special token definitions (BOS, EOS, padding, unknown), and pre-tokenization rules. The vocabulary is loaded from GGUF model metadata and used by the tokenization and detokenization pipelines.

Usage

Apply this principle when implementing tokenizer logic, handling special tokens in prompt construction, or when extending vocabulary support for new tokenizer types.

Theoretical Basis

Language model vocabularies define the discrete token set that the model operates over. Different tokenizer algorithms (BPE, SentencePiece, WordPiece) use different data structures: BPE requires merge rules that define how byte pairs are combined, SentencePiece uses scores to rank candidate tokenizations, and WordPiece uses a prefix-based lookup. The vocabulary system must also manage special tokens that have semantic meaning to the model (beginning/end of sequence, padding, unknown token) and ensure that these are correctly handled during both tokenization and generation. The header defines the data structures and interface used throughout the codebase to access vocabulary information.

Related Pages

Implementation:Ggml_org_Llama_cpp_Vocab_Header

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment