Implementation:Ollama Ollama Tokenizer SentencePiece

Knowledge Sources	Ollama
Domains	Tokenization, Text Processing
Last Updated	2025-02-15 00:00 GMT

Overview

Implements SentencePiece (Unigram) tokenization, the algorithm used by Llama 2, Gemma, and other models that use score-based subword segmentation.

Description

SentencePiece wraps a Vocabulary with score-based encoding. Encode first splits by special tokens, then replaces spaces with the SentencePiece whitespace separator (unicode block character). Uses a priority queue (min-heap) to iteratively merge adjacent token pairs: for each pair, checks if the concatenated string exists in the vocabulary and uses the vocabulary score as the merge priority. Maintains a linked-list-like merge structure to efficiently update adjacent pairs after merges. Falls back to individual character tokens when no vocabulary entry matches.

Usage

Used for models that specify SentencePiece/Unigram tokenization (Llama 2, Mistral, Gemma 1, T5, etc.).

Code Reference

Source Location

Repository: Ollama
File: tokenizer/sentencepiece.go
Lines: 1-249

Signature

type SentencePiece struct {
    maxTokenLen int
    vocab       *Vocabulary
}

func NewSentencePiece(vocab *Vocabulary) SentencePiece
func (spm SentencePiece) Vocabulary() *Vocabulary
func (spm SentencePiece) Is(id int32, special Special) bool
func (spm SentencePiece) Encode(s string, addSpecial bool) ([]int32, error)

var _ Tokenizer = (*SentencePiece)(nil)

Import

import "github.com/ollama/ollama/tokenizer"

I/O Contract

Inputs

Name	Type	Required	Description
s	string	Yes	Text to tokenize
addSpecial	bool	Yes	Whether to add BOS/EOS tokens

Outputs

Name	Type	Description
ids	[]int32	Token IDs
error	error	Encoding error

Usage Examples

vocab := &tokenizer.Vocabulary{
    Values: []string{"<s>", "</s>", "hello", "world", ...},
    Scores: []float32{0, 0, -1.5, -2.0, ...},
    Types:  []int32{3, 3, 1, 1, ...},
}
spm := tokenizer.NewSentencePiece(vocab)

ids, err := spm.Encode("hello world", true)

Related Pages

Principle:Ollama_Ollama_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment