Implementation:Ollama Ollama Tokenizer Vocabulary

Knowledge Sources	Ollama
Domains	Tokenization, Text Processing
Last Updated	2025-02-15 00:00 GMT

Overview

Defines the shared vocabulary data structure used by all tokenizer implementations (BPE, SentencePiece, WordPiece), providing token encoding, decoding, special token handling, and merge lookups.

Description

Vocabulary stores parallel arrays of token values, types, scores, and merges. Provides lazy-initialized (via sync.Once) reverse lookup maps for efficient encoding (Encode: string to ID) and merge lookup (Merge: pair to rank). Supports BOS/EOS token identification with configurable AddBOS/AddEOS flags. SpecialVocabulary returns all control and user-defined tokens for special token splitting during encoding. Token types follow GGML conventions: normal, unknown, control, user-defined, unused, and byte.

Usage

The foundational data structure backing all tokenizer implementations. Loaded from model metadata during model initialization.

Code Reference

Source Location

Repository: Ollama
File: tokenizer/vocabulary.go
Lines: 1-112

Signature

type Special int32
const (
    SpecialBOS Special = iota
    SpecialEOS
)

type Vocabulary struct {
    Values []string
    Types  []int32
    Scores []float32
    Merges []string
    BOS, EOS       []int32
    AddBOS, AddEOS bool
}

func (v *Vocabulary) Is(id int32, special Special) bool
func (v *Vocabulary) Encode(s string) int32
func (v *Vocabulary) Decode(id int32) string
func (v *Vocabulary) SpecialVocabulary() []string
func (v *Vocabulary) Merge(left, right string) int

Import

import "github.com/ollama/ollama/tokenizer"

I/O Contract

Inputs

Name	Type	Required	Description
s	string	Yes	Token string to encode
id	int32	Yes	Token ID to decode

Outputs

Name	Type	Description
id	int32	Token ID (-1 if not found)
token	string	Token string value

Usage Examples

vocab := &tokenizer.Vocabulary{
    Values: []string{"<s>", "</s>", "hello", "world"},
    Types:  []int32{3, 3, 1, 1},
    Scores: []float32{0, 0, -1.5, -2.0},
    BOS:    []int32{0},
    EOS:    []int32{1},
    AddBOS: true,
}

id := vocab.Encode("hello") // 2
token := vocab.Decode(2)    // "hello"
isBOS := vocab.Is(0, tokenizer.SpecialBOS) // true
specials := vocab.SpecialVocabulary() // ["<s>", "</s>"]

Related Pages

Principle:Ollama_Ollama_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment