Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ggml org Llama cpp Vocabulary System

From Leeroopedia
Knowledge Sources
Domains Tokenization, Vocabulary
Last Updated 2026-02-15 00:00 GMT

Overview

The Vocabulary System is the principle of managing the token vocabulary, special tokens, and tokenizer configuration used by language models.

Description

This principle covers the vocabulary data structure that stores the mapping between token IDs and their text representations, along with tokenizer-specific metadata such as merge rules (for BPE), scores (for SentencePiece), special token definitions (BOS, EOS, padding, unknown), and pre-tokenization rules. The vocabulary is loaded from GGUF model metadata and used by the tokenization and detokenization pipelines.

Usage

Apply this principle when implementing tokenizer logic, handling special tokens in prompt construction, or when extending vocabulary support for new tokenizer types.

Theoretical Basis

Language model vocabularies define the discrete token set that the model operates over. Different tokenizer algorithms (BPE, SentencePiece, WordPiece) use different data structures: BPE requires merge rules that define how byte pairs are combined, SentencePiece uses scores to rank candidate tokenizations, and WordPiece uses a prefix-based lookup. The vocabulary system must also manage special tokens that have semantic meaning to the model (beginning/end of sequence, padding, unknown token) and ensure that these are correctly handled during both tokenization and generation. The header defines the data structures and interface used throughout the codebase to access vocabulary information.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment