Implementation:Ggml org Llama cpp Tokenize Tool
| Knowledge Sources | |
|---|---|
| Domains | Tokenization |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
CLI tool that tokenizes text input using a model's tokenizer and displays the resulting tokens with their IDs.
Description
Loads a GGUF model to access its tokenizer, reads prompt text from a CLI argument, file, or stdin. Tokenizes the input using `llama_tokenize` with configurable options (BOS token, escape sequences, special token parsing). Outputs either human-readable token strings with IDs or just numerical token IDs in a Python-parseable format `[1, 2, 3]`. Handles Windows-specific UTF-8 encoding via `CommandLineToArgvW` and supports `--show-count` to display total token count.
Usage
Use this tool for debugging tokenizer behavior, prompt engineering, tokenizer validation, and understanding how a model's tokenizer breaks down text into token boundaries.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/tokenize/tokenize.cpp
- Lines: 1-416
Signature
// Main entry point
int main(int argc, char ** argv);
// Utility functions
static void print_usage_information(const char * argv0);
static std::string read_prompt_from_file(const char * filepath, bool & success);
static std::vector<std::string> ingest_args(int raw_argc, char ** raw_argv);
Import
#include "common.h"
#include "llama.h"
#include <cstdio>
#include <cstring>
#include <fstream>
#include <string>
#include <vector>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -m, --model | string | Yes | Path to the GGUF model file (used for its tokenizer) |
| -p, --prompt | string | No | Text to tokenize (from CLI argument) |
| -f, --file | string | No | Path to a file containing text to tokenize |
| --stdin | flag | No | Read text to tokenize from standard input |
| --ids | flag | No | Output only numerical token IDs in Python list format |
| --no-bos | flag | No | Do not prepend BOS token |
| --no-escape | flag | No | Do not process escape sequences (\\n, \\t, etc.) |
| --no-parse-special | flag | No | Do not parse special/control tokens |
| --show-count | flag | No | Print total token count |
| --log-disable | flag | No | Suppress model loading log output |
Outputs
| Name | Type | Description |
|---|---|---|
| token output | stdout | Token strings with IDs, or numerical IDs in Python list format |
| token count | stdout | Total number of tokens (when --show-count is used) |
| return code | int | 0 on success, 1 on error |
Usage Examples
# Basic tokenization with human-readable output
./tokenize -m model.gguf -p "Hello, world!"
# Output only token IDs in Python format
./tokenize -m model.gguf -p "Hello, world!" --ids
# Tokenize from file, show count
./tokenize -m model.gguf -f input.txt --show-count
# Read from stdin
echo "Hello world" | ./tokenize -m model.gguf --stdin