Implementation:Ollama Ollama Imagegen Tokenizer
| Knowledge Sources | |
|---|---|
| Domains | Image Generation, Tokenization |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements BPE, SentencePiece, and WordPiece tokenizers for HuggingFace model formats used by the imagegen subsystem.
Description
The tokenizer.go file provides a complete tokenizer implementation supporting three algorithms: GPT-2 byte-level BPE (with precomputed byte-to-rune encoding table), SentencePiece (with unicode space handling), and WordPiece (with ## continuation tokens). The Tokenizer struct holds a Vocabulary with values/reverse lookup, merge rules, BOS/EOS/PAD tokens, and optional byte fallback tokens. It loads from HuggingFace tokenizer.json format, parsing added_tokens, merge rules, pretokenizer patterns, and normalizer settings. Special token configuration is loaded hierarchically from generation_config.json, config.json, tokenizer_config.json, and special_tokens_map.json with priority-based resolution. Encoding uses parallel pretokenization followed by iterative BPE merges with cached pair rankings.
Usage
Used by all LLM models in the imagegen subsystem (Qwen3, Gemma3, Llama, GLM4, GPT-OSS) for text tokenization and detokenization.
Code Reference
Source Location
- Repository: Ollama
- File: x/imagegen/tokenizer/tokenizer.go
- Lines: 1-1173
Signature
type TokenizerType int
const (
TokenizerBPE TokenizerType = iota
TokenizerSentencePiece
TokenizerWordPiece
)
type Vocabulary struct {
Values []string
Reverse map[string]int32
Merges map[string]int
BOS int32
EOS []int32
PAD int32
AddBOS bool
AddEOS bool
}
type Tokenizer struct {
vocab *Vocabulary
pretokenizer *regexp.Regexp
specialTokens map[string]int32
typ TokenizerType
}
func Load(path string) (*Tokenizer, error)
func (t *Tokenizer) Encode(text string, addBOS bool) []int32
func (t *Tokenizer) Decode(tokens []int32) string
func (t *Tokenizer) Vocab() *Vocabulary
Import
import "github.com/ollama/ollama/x/imagegen/tokenizer"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | string | Yes | Path to tokenizer.json file |
| text | string | Yes | Input text to tokenize |
| addBOS | bool | Yes | Whether to prepend BOS token |
Outputs
| Name | Type | Description |
|---|---|---|
| *Tokenizer | *Tokenizer | Loaded tokenizer ready for encode/decode |
| []int32 | []int32 | Token IDs from encoding |
| string | string | Decoded text from token IDs |
Usage Examples
tok, err := tokenizer.Load("/path/to/tokenizer.json")
if err != nil {
return err
}
tokens := tok.Encode("Hello, world!", true) // with BOS
text := tok.Decode(tokens)
// text == "Hello, world!"