Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Imagegen Tokenizer

From Leeroopedia
Knowledge Sources
Domains Image Generation, Tokenization
Last Updated 2025-02-15 00:00 GMT

Overview

Implements BPE, SentencePiece, and WordPiece tokenizers for HuggingFace model formats used by the imagegen subsystem.

Description

The tokenizer.go file provides a complete tokenizer implementation supporting three algorithms: GPT-2 byte-level BPE (with precomputed byte-to-rune encoding table), SentencePiece (with unicode space handling), and WordPiece (with ## continuation tokens). The Tokenizer struct holds a Vocabulary with values/reverse lookup, merge rules, BOS/EOS/PAD tokens, and optional byte fallback tokens. It loads from HuggingFace tokenizer.json format, parsing added_tokens, merge rules, pretokenizer patterns, and normalizer settings. Special token configuration is loaded hierarchically from generation_config.json, config.json, tokenizer_config.json, and special_tokens_map.json with priority-based resolution. Encoding uses parallel pretokenization followed by iterative BPE merges with cached pair rankings.

Usage

Used by all LLM models in the imagegen subsystem (Qwen3, Gemma3, Llama, GLM4, GPT-OSS) for text tokenization and detokenization.

Code Reference

Source Location

  • Repository: Ollama
  • File: x/imagegen/tokenizer/tokenizer.go
  • Lines: 1-1173

Signature

type TokenizerType int

const (
	TokenizerBPE           TokenizerType = iota
	TokenizerSentencePiece
	TokenizerWordPiece
)

type Vocabulary struct {
	Values  []string
	Reverse map[string]int32
	Merges  map[string]int
	BOS     int32
	EOS     []int32
	PAD     int32
	AddBOS  bool
	AddEOS  bool
}

type Tokenizer struct {
	vocab         *Vocabulary
	pretokenizer  *regexp.Regexp
	specialTokens map[string]int32
	typ           TokenizerType
}

func Load(path string) (*Tokenizer, error)
func (t *Tokenizer) Encode(text string, addBOS bool) []int32
func (t *Tokenizer) Decode(tokens []int32) string
func (t *Tokenizer) Vocab() *Vocabulary

Import

import "github.com/ollama/ollama/x/imagegen/tokenizer"

I/O Contract

Inputs

Name Type Required Description
path string Yes Path to tokenizer.json file
text string Yes Input text to tokenize
addBOS bool Yes Whether to prepend BOS token

Outputs

Name Type Description
*Tokenizer *Tokenizer Loaded tokenizer ready for encode/decode
[]int32 []int32 Token IDs from encoding
string string Decoded text from token IDs

Usage Examples

tok, err := tokenizer.Load("/path/to/tokenizer.json")
if err != nil {
    return err
}

tokens := tok.Encode("Hello, world!", true) // with BOS
text := tok.Decode(tokens)
// text == "Hello, world!"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment