Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Tokenize Tool

From Leeroopedia
Knowledge Sources
Domains Tokenization
Last Updated 2026-02-15 00:00 GMT

Overview

CLI tool that tokenizes text input using a model's tokenizer and displays the resulting tokens with their IDs.

Description

Loads a GGUF model to access its tokenizer, reads prompt text from a CLI argument, file, or stdin. Tokenizes the input using `llama_tokenize` with configurable options (BOS token, escape sequences, special token parsing). Outputs either human-readable token strings with IDs or just numerical token IDs in a Python-parseable format `[1, 2, 3]`. Handles Windows-specific UTF-8 encoding via `CommandLineToArgvW` and supports `--show-count` to display total token count.

Usage

Use this tool for debugging tokenizer behavior, prompt engineering, tokenizer validation, and understanding how a model's tokenizer breaks down text into token boundaries.

Code Reference

Source Location

Signature

// Main entry point
int main(int argc, char ** argv);

// Utility functions
static void print_usage_information(const char * argv0);
static std::string read_prompt_from_file(const char * filepath, bool & success);
static std::vector<std::string> ingest_args(int raw_argc, char ** raw_argv);

Import

#include "common.h"
#include "llama.h"
#include <cstdio>
#include <cstring>
#include <fstream>
#include <string>
#include <vector>

I/O Contract

Inputs

Name Type Required Description
-m, --model string Yes Path to the GGUF model file (used for its tokenizer)
-p, --prompt string No Text to tokenize (from CLI argument)
-f, --file string No Path to a file containing text to tokenize
--stdin flag No Read text to tokenize from standard input
--ids flag No Output only numerical token IDs in Python list format
--no-bos flag No Do not prepend BOS token
--no-escape flag No Do not process escape sequences (\\n, \\t, etc.)
--no-parse-special flag No Do not parse special/control tokens
--show-count flag No Print total token count
--log-disable flag No Suppress model loading log output

Outputs

Name Type Description
token output stdout Token strings with IDs, or numerical IDs in Python list format
token count stdout Total number of tokens (when --show-count is used)
return code int 0 on success, 1 on error

Usage Examples

# Basic tokenization with human-readable output
./tokenize -m model.gguf -p "Hello, world!"

# Output only token IDs in Python format
./tokenize -m model.gguf -p "Hello, world!" --ids

# Tokenize from file, show count
./tokenize -m model.gguf -f input.txt --show-count

# Read from stdin
echo "Hello world" | ./tokenize -m model.gguf --stdin

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment