Implementation:Ollama Ollama ParseTokenizer
| Knowledge Sources | |
|---|---|
| Domains | NLP, Format_Conversion |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Concrete tool for extracting tokenizer data from HuggingFace model directories provided by the convert package.
Description
parseTokenizer reads tokenizer.json, tokenizer_config.json, and special_tokens_map.json to extract the BPE vocabulary, merge rules, special tokens, pre-tokenizer type, and chat template.
parseSentencePiece handles the alternative SentencePiece format by parsing the protobuf-encoded tokenizer.model file.
Both produce a Tokenizer struct containing the vocabulary (tokens, scores, types), merge rules, and special token mappings.
Usage
Called internally during model conversion to extract tokenizer metadata for GGUF embedding.
Code Reference
Source Location
- Repository: ollama
- File: convert/tokenizer.go (parseTokenizer), convert/tokenizer_spm.go (parseSentencePiece)
- Lines: tokenizer.go:L36-235, tokenizer_spm.go:L19-122
Signature
func parseTokenizer(fsys fs.FS, specialTokenTypes []string) (*Tokenizer, error)
func parseSentencePiece(fsys fs.FS) (*Vocabulary, error)
Import
import "github.com/ollama/ollama/convert"
I/O Contract
Inputs (parseTokenizer)
| Name | Type | Required | Description |
|---|---|---|---|
| fsys | fs.FS | Yes | Model directory with tokenizer.json and config files |
| specialTokenTypes | []string | Yes | Special token types to extract (e.g., "bos", "eos", "unk", "sep", "pad") |
Inputs (parseSentencePiece)
| Name | Type | Required | Description |
|---|---|---|---|
| fsys | fs.FS | Yes | Model directory with tokenizer.model protobuf file |
Outputs
| Name | Type | Description |
|---|---|---|
| *Tokenizer | *Tokenizer | Tokenizer with Vocabulary (tokens, scores, types), Merges, Pre-tokenizer, Template, SpecialVocabulary |
| error | error | Non-nil if tokenizer files missing or malformed |
Usage Examples
Internal Usage
// From convert/convert.go LoadModelMetadata
t, err := parseTokenizer(fsys, []string{"bos", "eos", "unk", "sep", "pad"})
if err != nil {
// try SentencePiece fallback
vocab, err := parseSentencePiece(fsys)
...
}