Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ollama Ollama Tokenizer Extraction

From Leeroopedia
Knowledge Sources
Domains NLP, Format_Conversion
Last Updated 2026-02-14 00:00 GMT

Overview

A multi-format tokenizer extraction mechanism that reads vocabulary, merge rules, and special tokens from HuggingFace BPE or SentencePiece formats and converts them to GGUF tokenizer metadata.

Description

Tokenizer Extraction parses tokenizer configuration files from HuggingFace model directories and produces the vocabulary, merge rules, token scores, and special token mappings needed for the GGUF format. It supports two major tokenizer families: BPE (Byte-Pair Encoding, from tokenizer.json) and SentencePiece (from tokenizer.model protobuf files).

The extraction must handle various edge cases: added tokens with special properties, pre-tokenizer types (GPT-2 style, Llama style), chat templates embedded in tokenizer config, and vocabulary padding.

Usage

Use this principle when converting models between frameworks that use different tokenizer serialization formats. The extracted tokenizer data becomes part of the GGUF metadata, enabling self-contained model files.

Theoretical Basis

Tokenizer extraction handles two formats:

BPE (tokenizer.json):

  1. Parse the JSON vocabulary (token → ID mapping)
  2. Extract merge rules (ordered pairs of subword units)
  3. Map special tokens (BOS, EOS, UNK, PAD) from special_tokens_map.json
  4. Extract pre-tokenizer type and chat template from tokenizer_config.json

SentencePiece (tokenizer.model):

  1. Parse the protobuf-encoded model file
  2. Extract tokens with scores (log probabilities)
  3. Map token types (normal, unknown, control, byte)
  4. Handle byte fallback tokens

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment