Implementation:Ollama Ollama Tokenizer WordPiece
| Knowledge Sources | |
|---|---|
| Domains | Tokenization, Text Processing |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements WordPiece tokenization, the algorithm used by BERT-style models, with CJK character handling and longest-match-first subword decomposition.
Description
WordPiece wraps a Vocabulary with optional lowercase normalization. words splits input by whitespace and punctuation, with special handling for CJK Unicode ranges (treating each CJK character as a separate word). Encode uses a greedy longest-match-first algorithm: for each word, it tries the longest possible subword starting from the beginning; if not found in vocabulary, it shortens by one character and retries. Uses GGML-style word boundary prefix ([U+2581]) instead of the original WordPiece "##" subword prefix. Decode handles the reverse mapping with common English contraction rules.
Usage
Used for BERT-based models (embedding models, rerankers) that use WordPiece tokenization.
Code Reference
Source Location
- Repository: Ollama
- File: tokenizer/wordpiece.go
- Lines: 1-171
Signature
type WordPiece struct {
vocab *Vocabulary
lowercase bool
}
func NewWordPiece(vocab *Vocabulary, lowercase bool) WordPiece
func (wpm WordPiece) Encode(s string, addSpecial bool) ([]int32, error)
func (wpm WordPiece) Decode(ids []int32) (string, error)
func (wpm WordPiece) Is(id int32, special Special) bool
func (wpm WordPiece) Vocabulary() *Vocabulary
var _ Tokenizer = (*WordPiece)(nil)
Import
import "github.com/ollama/ollama/tokenizer"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| s | string | Yes | Text to tokenize |
| addSpecial | bool | Yes | Whether to add BOS/EOS tokens |
| lowercase | bool | Yes | Whether to lowercase input before encoding |
Outputs
| Name | Type | Description |
|---|---|---|
| ids | []int32 | Token IDs |
| error | error | Encoding error |
Usage Examples
vocab := &tokenizer.Vocabulary{...}
wpm := tokenizer.NewWordPiece(vocab, true) // lowercase=true for uncased BERT
ids, err := wpm.Encode("Hello world", true)
// Encodes with BOS/EOS tokens
text, err := wpm.Decode(ids)
// Reconstructs original text with contraction fixes