Implementation:Ggml org Llama cpp Llama Tokenize

Knowledge Sources	Domains	Last Updated
ggml-org/llama.cpp	Text Tokenization, BPE Encoding, Token ID Conversion	2026-02-14

Overview

Description

llama_tokenize converts a text string into a sequence of integer token IDs using the vocabulary and tokenization rules embedded in the model. The function supports BPE, SentencePiece, and WordPiece tokenization schemes depending on the model's vocabulary type. It handles special token insertion (BOS/EOS) and can optionally parse special token text representations in the input string.

The function follows a two-pass pattern: calling with a zero-size output buffer returns the negated count of required tokens, allowing the caller to allocate the exact buffer size needed before performing the actual tokenization.

Usage

#include "llama.h"

const llama_vocab * vocab = llama_model_get_vocab(model);
const char * prompt = "Hello my name is";

// Pass 1: determine token count
const int n_prompt = -llama_tokenize(vocab, prompt, strlen(prompt), NULL, 0, true, true);

// Pass 2: perform tokenization
std::vector<llama_token> tokens(n_prompt);
if (llama_tokenize(vocab, prompt, strlen(prompt), tokens.data(), tokens.size(), true, true) < 0) {
    fprintf(stderr, "Tokenization failed\n");
    return 1;
}

Code Reference

Source Location

File	Line(s)	Type
`include/llama.h`	1102-1109	Declaration
`src/llama-vocab.cpp`	3908-3917	Implementation

Signature

LLAMA_API int32_t llama_tokenize(
    const struct llama_vocab * vocab,
                  const char * text,
                     int32_t   text_len,
                 llama_token * tokens,
                     int32_t   n_tokens_max,
                        bool   add_special,
                        bool   parse_special);

Import

#include "llama.h"

I/O Contract

Inputs

Parameter	Type	Description
`vocab`	`const struct llama_vocab *`	Vocabulary handle obtained from `llama_model_get_vocab(model)`. Contains the tokenizer rules and vocabulary table.
`text`	`const char *`	Input text string to tokenize. Does not need to be null-terminated; length is specified by `text_len`.
`text_len`	`int32_t`	Length of the input text in bytes.
`tokens`	`llama_token *`	Output buffer for token IDs. Can be NULL when querying the required token count (set `n_tokens_max` to 0).
`n_tokens_max`	`int32_t`	Maximum number of tokens that the output buffer can hold. Set to 0 with `tokens = NULL` to query the required count.
`add_special`	`bool`	If true, automatically add BOS and EOS tokens if the model is configured to include them.
`parse_special`	`bool`	endoftext\|>") as their corresponding token IDs. If false, such text is tokenized as regular plaintext. Does not insert a leading space.

Outputs

Return	Type	Description
token count	`int32_t`	On success: the number of tokens written to the output buffer (no more than `n_tokens_max`).
negative count	`int32_t`	On insufficient buffer: a negative number whose absolute value is the number of tokens that would have been produced. Use this to allocate the correct buffer size.
INT32_MIN	`int32_t`	On overflow: the tokenization result size exceeds the int32_t limit.

Usage Examples

Two-Pass Tokenization (from examples/simple/simple.cpp)

const llama_vocab * vocab = llama_model_get_vocab(model);

std::string prompt = "Hello my name is";

// First pass: find the number of tokens in the prompt
const int n_prompt = -llama_tokenize(vocab, prompt.c_str(), prompt.size(), NULL, 0, true, true);

// Second pass: allocate space and tokenize
std::vector<llama_token> prompt_tokens(n_prompt);
if (llama_tokenize(vocab, prompt.c_str(), prompt.size(),
                   prompt_tokens.data(), prompt_tokens.size(),
                   true, true) < 0) {
    fprintf(stderr, "error: failed to tokenize the prompt\n");
    return 1;
}

Tokenization Without Special Tokens

const char * text = "Hello world";

// Do not add BOS/EOS, do not parse special tokens
int n = -llama_tokenize(vocab, text, strlen(text), NULL, 0, false, false);
llama_token tokens[128];
llama_tokenize(vocab, text, strlen(text), tokens, 128, false, false);

Converting Tokens Back to Text

After generation, use llama_token_to_piece to convert individual tokens back to text:

for (auto id : prompt_tokens) {
    char buf[128];
    int n = llama_token_to_piece(vocab, id, buf, sizeof(buf), 0, true);
    if (n < 0) {
        fprintf(stderr, "error: failed to convert token to piece\n");
        return 1;
    }
    std::string s(buf, n);
    printf("%s", s.c_str());
}

Related Pages

Principle:Ggml_org_Llama_cpp_Prompt_Tokenization
Implementation:Ggml_org_Llama_cpp_Llama_Model_Load_From_File -- vocabulary is part of the loaded model
Implementation:Ggml_org_Llama_cpp_Llama_Decode -- tokenized output is passed to decode as a batch

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment