Principle:Ggml org Llama cpp Prompt Tokenization
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| ggml-org/llama.cpp | Tokenization, BPE, SentencePiece, WordPiece, Subword Encoding | 2026-02-14 |
Overview
Description
Prompt Tokenization is the step in the llama.cpp text generation pipeline that converts a raw text string into a sequence of integer token IDs that the model can process. Language models do not operate on characters or words directly; they operate on tokens -- subword units drawn from a fixed vocabulary learned during model training. Tokenization is the bridge between human-readable text and the model's numeric input representation.
The tokenization process must faithfully reproduce the exact encoding scheme used during model training. Using a different tokenizer or different settings would produce mismatched token IDs, leading to nonsensical model outputs. This is why the tokenizer vocabulary and merge rules are stored within the GGUF model file itself and loaded alongside the model weights.
Usage
Tokenization is performed after model loading (to obtain the vocabulary handle) and before batch decoding. It is also used in the reverse direction (token-to-text) for rendering generated tokens back to human-readable output.
const llama_vocab * vocab = llama_model_get_vocab(model);
const char * text = "Hello, world!";
int text_len = strlen(text);
// First call: determine how many tokens are needed
int n_tokens = -llama_tokenize(vocab, text, text_len, NULL, 0, true, true);
// Second call: perform the actual tokenization
llama_token * tokens = malloc(n_tokens * sizeof(llama_token));
llama_tokenize(vocab, text, text_len, tokens, n_tokens, true, true);
Theoretical Basis
Why Subword Tokenization?
Character-level tokenization produces very long sequences (one token per character), which makes transformer self-attention prohibitively expensive. Word-level tokenization produces shorter sequences but cannot handle out-of-vocabulary words, misspellings, or morphological variations.
Subword tokenization strikes a balance: common words are represented as single tokens, while rare or unseen words are decomposed into smaller subword pieces. For example, the word "tokenization" might be split into ["token", "ization"], while the common word "the" is a single token. This gives models a fixed-size vocabulary (typically 32K-128K tokens) that can represent any text.
Byte Pair Encoding (BPE)
BPE is the most common tokenization algorithm used by modern LLMs (GPT-family, LLaMA, Mistral, etc.). The algorithm works as follows:
Training phase (performed once, results stored in the model):
- Start with a vocabulary of individual bytes (or characters)
- Count all adjacent token pairs in the training corpus
- Merge the most frequent pair into a new token
- Repeat steps 2-3 until the desired vocabulary size is reached
Encoding phase (performed at inference time):
- Split the input text into individual bytes/characters
- Repeatedly find the highest-priority merge pair in the current sequence and apply it
- Continue until no more merges can be applied
- The resulting sequence of tokens is the encoded output
The merge rules (which pairs to merge and in what order) are stored in the model's vocabulary data. The priority ordering ensures deterministic encoding.
SentencePiece
SentencePiece is an alternative tokenization framework used by some models (original LLaMA, multilingual models). It has several key differences from standard BPE:
- Treats whitespace as a regular character -- spaces are encoded as a special Unicode character (U+2581, the "lower one eighth block") and included in tokens, rather than being used as word boundaries
- Unigram model option -- besides BPE, SentencePiece supports a unigram language model approach where the most probable segmentation of the input is found using the Viterbi algorithm
- Byte fallback -- unknown characters can be represented as byte tokens, ensuring any input can be encoded
WordPiece
WordPiece (used by BERT-family models) is similar to BPE but differs in the merge criterion: instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data under a unigram language model. In practice, this means it greedily selects the longest matching token from left to right, prefixing continuation tokens with "##".
Special Tokens
Language models define special tokens with semantic meaning beyond regular text:
- BOS (Beginning of Sequence) -- marks the start of an input sequence
- EOS (End of Sequence) -- marks the end of a generated sequence
- EOT (End of Turn) -- marks turn boundaries in chat models
- PAD (Padding) -- used to pad sequences to equal length in batches
The add_special parameter controls whether BOS/EOS tokens are automatically added during tokenization. The parse_special parameter controls whether special token text representations (like "<|endoftext|>") in the input string are recognized as their corresponding token IDs rather than being tokenized as regular text.
The Two-Pass Pattern
llama.cpp's tokenize function supports a common two-pass usage pattern:
- First pass -- call with
n_tokens_max = 0andtokens = NULL. The function returns a negative number whose absolute value is the number of tokens needed. - Second pass -- allocate a buffer of the required size and call again to perform the actual tokenization.
This avoids the need to pre-allocate a conservatively large buffer and handles variable-length encodings correctly.
Related Pages
- Implementation:Ggml_org_Llama_cpp_Llama_Tokenize
- Principle:Ggml_org_Llama_cpp_GGUF_Model_Loading -- vocabulary is loaded as part of the model
- Principle:Ggml_org_Llama_cpp_Batch_Decoding -- tokenized output is fed into batch decoding