Implementation:Ollama Ollama Tokenizer Vocabulary
| Knowledge Sources | |
|---|---|
| Domains | Tokenization, Text Processing |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Defines the shared vocabulary data structure used by all tokenizer implementations (BPE, SentencePiece, WordPiece), providing token encoding, decoding, special token handling, and merge lookups.
Description
Vocabulary stores parallel arrays of token values, types, scores, and merges. Provides lazy-initialized (via sync.Once) reverse lookup maps for efficient encoding (Encode: string to ID) and merge lookup (Merge: pair to rank). Supports BOS/EOS token identification with configurable AddBOS/AddEOS flags. SpecialVocabulary returns all control and user-defined tokens for special token splitting during encoding. Token types follow GGML conventions: normal, unknown, control, user-defined, unused, and byte.
Usage
The foundational data structure backing all tokenizer implementations. Loaded from model metadata during model initialization.
Code Reference
Source Location
- Repository: Ollama
- File: tokenizer/vocabulary.go
- Lines: 1-112
Signature
type Special int32
const (
SpecialBOS Special = iota
SpecialEOS
)
type Vocabulary struct {
Values []string
Types []int32
Scores []float32
Merges []string
BOS, EOS []int32
AddBOS, AddEOS bool
}
func (v *Vocabulary) Is(id int32, special Special) bool
func (v *Vocabulary) Encode(s string) int32
func (v *Vocabulary) Decode(id int32) string
func (v *Vocabulary) SpecialVocabulary() []string
func (v *Vocabulary) Merge(left, right string) int
Import
import "github.com/ollama/ollama/tokenizer"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| s | string | Yes | Token string to encode |
| id | int32 | Yes | Token ID to decode |
Outputs
| Name | Type | Description |
|---|---|---|
| id | int32 | Token ID (-1 if not found) |
| token | string | Token string value |
Usage Examples
vocab := &tokenizer.Vocabulary{
Values: []string{"<s>", "</s>", "hello", "world"},
Types: []int32{3, 3, 1, 1},
Scores: []float32{0, 0, -1.5, -2.0},
BOS: []int32{0},
EOS: []int32{1},
AddBOS: true,
}
id := vocab.Encode("hello") // 2
token := vocab.Decode(2) // "hello"
isBOS := vocab.Is(0, tokenizer.SpecialBOS) // true
specials := vocab.SpecialVocabulary() // ["<s>", "</s>"]