Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Tokenizer Vocabulary

From Leeroopedia
Knowledge Sources
Domains Tokenization, Text Processing
Last Updated 2025-02-15 00:00 GMT

Overview

Defines the shared vocabulary data structure used by all tokenizer implementations (BPE, SentencePiece, WordPiece), providing token encoding, decoding, special token handling, and merge lookups.

Description

Vocabulary stores parallel arrays of token values, types, scores, and merges. Provides lazy-initialized (via sync.Once) reverse lookup maps for efficient encoding (Encode: string to ID) and merge lookup (Merge: pair to rank). Supports BOS/EOS token identification with configurable AddBOS/AddEOS flags. SpecialVocabulary returns all control and user-defined tokens for special token splitting during encoding. Token types follow GGML conventions: normal, unknown, control, user-defined, unused, and byte.

Usage

The foundational data structure backing all tokenizer implementations. Loaded from model metadata during model initialization.

Code Reference

Source Location

  • Repository: Ollama
  • File: tokenizer/vocabulary.go
  • Lines: 1-112

Signature

type Special int32
const (
    SpecialBOS Special = iota
    SpecialEOS
)

type Vocabulary struct {
    Values []string
    Types  []int32
    Scores []float32
    Merges []string
    BOS, EOS       []int32
    AddBOS, AddEOS bool
}

func (v *Vocabulary) Is(id int32, special Special) bool
func (v *Vocabulary) Encode(s string) int32
func (v *Vocabulary) Decode(id int32) string
func (v *Vocabulary) SpecialVocabulary() []string
func (v *Vocabulary) Merge(left, right string) int

Import

import "github.com/ollama/ollama/tokenizer"

I/O Contract

Inputs

Name Type Required Description
s string Yes Token string to encode
id int32 Yes Token ID to decode

Outputs

Name Type Description
id int32 Token ID (-1 if not found)
token string Token string value

Usage Examples

vocab := &tokenizer.Vocabulary{
    Values: []string{"<s>", "</s>", "hello", "world"},
    Types:  []int32{3, 3, 1, 1},
    Scores: []float32{0, 0, -1.5, -2.0},
    BOS:    []int32{0},
    EOS:    []int32{1},
    AddBOS: true,
}

id := vocab.Encode("hello") // 2
token := vocab.Decode(2)    // "hello"
isBOS := vocab.Is(0, tokenizer.SpecialBOS) // true
specials := vocab.SpecialVocabulary() // ["<s>", "</s>"]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment