Principle:Ggml org Llama cpp Prompt Tokenization

Knowledge Sources	Domains	Last Updated
ggml-org/llama.cpp	Tokenization, BPE, SentencePiece, WordPiece, Subword Encoding	2026-02-14

Overview

Description

Prompt Tokenization is the step in the llama.cpp text generation pipeline that converts a raw text string into a sequence of integer token IDs that the model can process. Language models do not operate on characters or words directly; they operate on tokens -- subword units drawn from a fixed vocabulary learned during model training. Tokenization is the bridge between human-readable text and the model's numeric input representation.

The tokenization process must faithfully reproduce the exact encoding scheme used during model training. Using a different tokenizer or different settings would produce mismatched token IDs, leading to nonsensical model outputs. This is why the tokenizer vocabulary and merge rules are stored within the GGUF model file itself and loaded alongside the model weights.

Usage

Tokenization is performed after model loading (to obtain the vocabulary handle) and before batch decoding. It is also used in the reverse direction (token-to-text) for rendering generated tokens back to human-readable output.

const llama_vocab * vocab = llama_model_get_vocab(model);

const char * text = "Hello, world!";
int text_len = strlen(text);

// First call: determine how many tokens are needed
int n_tokens = -llama_tokenize(vocab, text, text_len, NULL, 0, true, true);

// Second call: perform the actual tokenization
llama_token * tokens = malloc(n_tokens * sizeof(llama_token));
llama_tokenize(vocab, text, text_len, tokens, n_tokens, true, true);

Theoretical Basis

Why Subword Tokenization?

Character-level tokenization produces very long sequences (one token per character), which makes transformer self-attention prohibitively expensive. Word-level tokenization produces shorter sequences but cannot handle out-of-vocabulary words, misspellings, or morphological variations.

Subword tokenization strikes a balance: common words are represented as single tokens, while rare or unseen words are decomposed into smaller subword pieces. For example, the word "tokenization" might be split into ["token", "ization"], while the common word "the" is a single token. This gives models a fixed-size vocabulary (typically 32K-128K tokens) that can represent any text.

Byte Pair Encoding (BPE)

BPE is the most common tokenization algorithm used by modern LLMs (GPT-family, LLaMA, Mistral, etc.). The algorithm works as follows:

Training phase (performed once, results stored in the model):

Start with a vocabulary of individual bytes (or characters)
Count all adjacent token pairs in the training corpus
Merge the most frequent pair into a new token
Repeat steps 2-3 until the desired vocabulary size is reached

Encoding phase (performed at inference time):

Split the input text into individual bytes/characters
Repeatedly find the highest-priority merge pair in the current sequence and apply it
Continue until no more merges can be applied
The resulting sequence of tokens is the encoded output

The merge rules (which pairs to merge and in what order) are stored in the model's vocabulary data. The priority ordering ensures deterministic encoding.

SentencePiece

SentencePiece is an alternative tokenization framework used by some models (original LLaMA, multilingual models). It has several key differences from standard BPE:

Treats whitespace as a regular character -- spaces are encoded as a special Unicode character (U+2581, the "lower one eighth block") and included in tokens, rather than being used as word boundaries
Unigram model option -- besides BPE, SentencePiece supports a unigram language model approach where the most probable segmentation of the input is found using the Viterbi algorithm
Byte fallback -- unknown characters can be represented as byte tokens, ensuring any input can be encoded

WordPiece

WordPiece (used by BERT-family models) is similar to BPE but differs in the merge criterion: instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data under a unigram language model. In practice, this means it greedily selects the longest matching token from left to right, prefixing continuation tokens with "##".

Special Tokens

Language models define special tokens with semantic meaning beyond regular text:

BOS (Beginning of Sequence) -- marks the start of an input sequence
EOS (End of Sequence) -- marks the end of a generated sequence
EOT (End of Turn) -- marks turn boundaries in chat models
PAD (Padding) -- used to pad sequences to equal length in batches

The add_special parameter controls whether BOS/EOS tokens are automatically added during tokenization. The parse_special parameter controls whether special token text representations (like "<|endoftext|>") in the input string are recognized as their corresponding token IDs rather than being tokenized as regular text.

The Two-Pass Pattern

llama.cpp's tokenize function supports a common two-pass usage pattern:

First pass -- call with n_tokens_max = 0 and tokens = NULL. The function returns a negative number whose absolute value is the number of tokens needed.
Second pass -- allocate a buffer of the required size and call again to perform the actual tokenization.

This avoids the need to pre-allocate a conservatively large buffer and handles variable-length encodings correctly.

Related Pages

Implementation:Ggml_org_Llama_cpp_Llama_Tokenize
Principle:Ggml_org_Llama_cpp_GGUF_Model_Loading -- vocabulary is loaded as part of the model
Principle:Ggml_org_Llama_cpp_Batch_Decoding -- tokenized output is fed into batch decoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment