Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ollama Ollama ParseTokenizer

From Leeroopedia
Knowledge Sources
Domains NLP, Format_Conversion
Last Updated 2026-02-14 00:00 GMT

Overview

Concrete tool for extracting tokenizer data from HuggingFace model directories provided by the convert package.

Description

parseTokenizer reads tokenizer.json, tokenizer_config.json, and special_tokens_map.json to extract the BPE vocabulary, merge rules, special tokens, pre-tokenizer type, and chat template.

parseSentencePiece handles the alternative SentencePiece format by parsing the protobuf-encoded tokenizer.model file.

Both produce a Tokenizer struct containing the vocabulary (tokens, scores, types), merge rules, and special token mappings.

Usage

Called internally during model conversion to extract tokenizer metadata for GGUF embedding.

Code Reference

Source Location

  • Repository: ollama
  • File: convert/tokenizer.go (parseTokenizer), convert/tokenizer_spm.go (parseSentencePiece)
  • Lines: tokenizer.go:L36-235, tokenizer_spm.go:L19-122

Signature

func parseTokenizer(fsys fs.FS, specialTokenTypes []string) (*Tokenizer, error)
func parseSentencePiece(fsys fs.FS) (*Vocabulary, error)

Import

import "github.com/ollama/ollama/convert"

I/O Contract

Inputs (parseTokenizer)

Name Type Required Description
fsys fs.FS Yes Model directory with tokenizer.json and config files
specialTokenTypes []string Yes Special token types to extract (e.g., "bos", "eos", "unk", "sep", "pad")

Inputs (parseSentencePiece)

Name Type Required Description
fsys fs.FS Yes Model directory with tokenizer.model protobuf file

Outputs

Name Type Description
*Tokenizer *Tokenizer Tokenizer with Vocabulary (tokens, scores, types), Merges, Pre-tokenizer, Template, SpecialVocabulary
error error Non-nil if tokenizer files missing or malformed

Usage Examples

Internal Usage

// From convert/convert.go LoadModelMetadata
t, err := parseTokenizer(fsys, []string{"bos", "eos", "unk", "sep", "pad"})
if err != nil {
    // try SentencePiece fallback
    vocab, err := parseSentencePiece(fsys)
    ...
}

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment