Implementation:Ollama Ollama Imagegen Tokenizer

Knowledge Sources	Ollama
Domains	Image Generation, Tokenization
Last Updated	2025-02-15 00:00 GMT

Overview

Implements BPE, SentencePiece, and WordPiece tokenizers for HuggingFace model formats used by the imagegen subsystem.

Description

The tokenizer.go file provides a complete tokenizer implementation supporting three algorithms: GPT-2 byte-level BPE (with precomputed byte-to-rune encoding table), SentencePiece (with unicode space handling), and WordPiece (with ## continuation tokens). The Tokenizer struct holds a Vocabulary with values/reverse lookup, merge rules, BOS/EOS/PAD tokens, and optional byte fallback tokens. It loads from HuggingFace tokenizer.json format, parsing added_tokens, merge rules, pretokenizer patterns, and normalizer settings. Special token configuration is loaded hierarchically from generation_config.json, config.json, tokenizer_config.json, and special_tokens_map.json with priority-based resolution. Encoding uses parallel pretokenization followed by iterative BPE merges with cached pair rankings.

Usage

Used by all LLM models in the imagegen subsystem (Qwen3, Gemma3, Llama, GLM4, GPT-OSS) for text tokenization and detokenization.

Code Reference

Source Location

Repository: Ollama
File: x/imagegen/tokenizer/tokenizer.go
Lines: 1-1173

Signature

type TokenizerType int

const (
	TokenizerBPE           TokenizerType = iota
	TokenizerSentencePiece
	TokenizerWordPiece
)

type Vocabulary struct {
	Values  []string
	Reverse map[string]int32
	Merges  map[string]int
	BOS     int32
	EOS     []int32
	PAD     int32
	AddBOS  bool
	AddEOS  bool
}

type Tokenizer struct {
	vocab         *Vocabulary
	pretokenizer  *regexp.Regexp
	specialTokens map[string]int32
	typ           TokenizerType
}

func Load(path string) (*Tokenizer, error)
func (t *Tokenizer) Encode(text string, addBOS bool) []int32
func (t *Tokenizer) Decode(tokens []int32) string
func (t *Tokenizer) Vocab() *Vocabulary

Import

import "github.com/ollama/ollama/x/imagegen/tokenizer"

I/O Contract

Inputs

Name	Type	Required	Description
path	string	Yes	Path to tokenizer.json file
text	string	Yes	Input text to tokenize
addBOS	bool	Yes	Whether to prepend BOS token

Outputs

Name	Type	Description
*Tokenizer	*Tokenizer	Loaded tokenizer ready for encode/decode
[]int32	[]int32	Token IDs from encoding
string	string	Decoded text from token IDs

Usage Examples

tok, err := tokenizer.Load("/path/to/tokenizer.json")
if err != nil {
    return err
}

tokens := tok.Encode("Hello, world!", true) // with BOS
text := tok.Decode(tokens)
// text == "Hello, world!"

Related Pages

Principle:Ollama_Ollama_ImageGeneration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment