Implementation:Ollama Ollama Tokenizer SentencePiece
| Knowledge Sources | |
|---|---|
| Domains | Tokenization, Text Processing |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements SentencePiece (Unigram) tokenization, the algorithm used by Llama 2, Gemma, and other models that use score-based subword segmentation.
Description
SentencePiece wraps a Vocabulary with score-based encoding. Encode first splits by special tokens, then replaces spaces with the SentencePiece whitespace separator (unicode block character). Uses a priority queue (min-heap) to iteratively merge adjacent token pairs: for each pair, checks if the concatenated string exists in the vocabulary and uses the vocabulary score as the merge priority. Maintains a linked-list-like merge structure to efficiently update adjacent pairs after merges. Falls back to individual character tokens when no vocabulary entry matches.
Usage
Used for models that specify SentencePiece/Unigram tokenization (Llama 2, Mistral, Gemma 1, T5, etc.).
Code Reference
Source Location
- Repository: Ollama
- File: tokenizer/sentencepiece.go
- Lines: 1-249
Signature
type SentencePiece struct {
maxTokenLen int
vocab *Vocabulary
}
func NewSentencePiece(vocab *Vocabulary) SentencePiece
func (spm SentencePiece) Vocabulary() *Vocabulary
func (spm SentencePiece) Is(id int32, special Special) bool
func (spm SentencePiece) Encode(s string, addSpecial bool) ([]int32, error)
var _ Tokenizer = (*SentencePiece)(nil)
Import
import "github.com/ollama/ollama/tokenizer"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| s | string | Yes | Text to tokenize |
| addSpecial | bool | Yes | Whether to add BOS/EOS tokens |
Outputs
| Name | Type | Description |
|---|---|---|
| ids | []int32 | Token IDs |
| error | error | Encoding error |
Usage Examples
vocab := &tokenizer.Vocabulary{
Values: []string{"<s>", "</s>", "hello", "world", ...},
Scores: []float32{0, 0, -1.5, -2.0, ...},
Types: []int32{3, 3, 1, 1, ...},
}
spm := tokenizer.NewSentencePiece(vocab)
ids, err := spm.Encode("hello world", true)