Principle:Ollama Ollama Tokenizer Extraction
| Knowledge Sources | |
|---|---|
| Domains | NLP, Format_Conversion |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
A multi-format tokenizer extraction mechanism that reads vocabulary, merge rules, and special tokens from HuggingFace BPE or SentencePiece formats and converts them to GGUF tokenizer metadata.
Description
Tokenizer Extraction parses tokenizer configuration files from HuggingFace model directories and produces the vocabulary, merge rules, token scores, and special token mappings needed for the GGUF format. It supports two major tokenizer families: BPE (Byte-Pair Encoding, from tokenizer.json) and SentencePiece (from tokenizer.model protobuf files).
The extraction must handle various edge cases: added tokens with special properties, pre-tokenizer types (GPT-2 style, Llama style), chat templates embedded in tokenizer config, and vocabulary padding.
Usage
Use this principle when converting models between frameworks that use different tokenizer serialization formats. The extracted tokenizer data becomes part of the GGUF metadata, enabling self-contained model files.
Theoretical Basis
Tokenizer extraction handles two formats:
BPE (tokenizer.json):
- Parse the JSON vocabulary (token → ID mapping)
- Extract merge rules (ordered pairs of subword units)
- Map special tokens (BOS, EOS, UNK, PAD) from special_tokens_map.json
- Extract pre-tokenizer type and chat template from tokenizer_config.json
SentencePiece (tokenizer.model):
- Parse the protobuf-encoded model file
- Extract tokens with scores (log probabilities)
- Map token types (normal, unknown, control, byte)
- Handle byte fallback tokens