Principle:Ollama Ollama Tokenizer Extraction

Knowledge Sources	Ollama HuggingFace Tokenizers
Domains	NLP, Format_Conversion
Last Updated	2026-02-14 00:00 GMT

Overview

A multi-format tokenizer extraction mechanism that reads vocabulary, merge rules, and special tokens from HuggingFace BPE or SentencePiece formats and converts them to GGUF tokenizer metadata.

Description

Tokenizer Extraction parses tokenizer configuration files from HuggingFace model directories and produces the vocabulary, merge rules, token scores, and special token mappings needed for the GGUF format. It supports two major tokenizer families: BPE (Byte-Pair Encoding, from tokenizer.json) and SentencePiece (from tokenizer.model protobuf files).

The extraction must handle various edge cases: added tokens with special properties, pre-tokenizer types (GPT-2 style, Llama style), chat templates embedded in tokenizer config, and vocabulary padding.

Usage

Use this principle when converting models between frameworks that use different tokenizer serialization formats. The extracted tokenizer data becomes part of the GGUF metadata, enabling self-contained model files.

Theoretical Basis

Tokenizer extraction handles two formats:

BPE (tokenizer.json):

Parse the JSON vocabulary (token → ID mapping)
Extract merge rules (ordered pairs of subword units)
Map special tokens (BOS, EOS, UNK, PAD) from special_tokens_map.json
Extract pre-tokenizer type and chat template from tokenizer_config.json

SentencePiece (tokenizer.model):

Parse the protobuf-encoded model file
Extract tokens with scores (log probabilities)
Map token types (normal, unknown, control, byte)
Handle byte fallback tokens

Related Pages

Implemented By

Implementation:Ollama_Ollama_ParseTokenizer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment