Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Tokenizer SentencePiece

From Leeroopedia
Knowledge Sources
Domains Tokenization, Text Processing
Last Updated 2025-02-15 00:00 GMT

Overview

Implements SentencePiece (Unigram) tokenization, the algorithm used by Llama 2, Gemma, and other models that use score-based subword segmentation.

Description

SentencePiece wraps a Vocabulary with score-based encoding. Encode first splits by special tokens, then replaces spaces with the SentencePiece whitespace separator (unicode block character). Uses a priority queue (min-heap) to iteratively merge adjacent token pairs: for each pair, checks if the concatenated string exists in the vocabulary and uses the vocabulary score as the merge priority. Maintains a linked-list-like merge structure to efficiently update adjacent pairs after merges. Falls back to individual character tokens when no vocabulary entry matches.

Usage

Used for models that specify SentencePiece/Unigram tokenization (Llama 2, Mistral, Gemma 1, T5, etc.).

Code Reference

Source Location

  • Repository: Ollama
  • File: tokenizer/sentencepiece.go
  • Lines: 1-249

Signature

type SentencePiece struct {
    maxTokenLen int
    vocab       *Vocabulary
}

func NewSentencePiece(vocab *Vocabulary) SentencePiece
func (spm SentencePiece) Vocabulary() *Vocabulary
func (spm SentencePiece) Is(id int32, special Special) bool
func (spm SentencePiece) Encode(s string, addSpecial bool) ([]int32, error)

var _ Tokenizer = (*SentencePiece)(nil)

Import

import "github.com/ollama/ollama/tokenizer"

I/O Contract

Inputs

Name Type Required Description
s string Yes Text to tokenize
addSpecial bool Yes Whether to add BOS/EOS tokens

Outputs

Name Type Description
ids []int32 Token IDs
error error Encoding error

Usage Examples

vocab := &tokenizer.Vocabulary{
    Values: []string{"<s>", "</s>", "hello", "world", ...},
    Scores: []float32{0, 0, -1.5, -2.0, ...},
    Types:  []int32{3, 3, 1, 1, ...},
}
spm := tokenizer.NewSentencePiece(vocab)

ids, err := spm.Encode("hello world", true)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment