Implementation:Ollama Ollama Tokenizer WordPiece

Knowledge Sources	Ollama
Domains	Tokenization, Text Processing
Last Updated	2025-02-15 00:00 GMT

Overview

Implements WordPiece tokenization, the algorithm used by BERT-style models, with CJK character handling and longest-match-first subword decomposition.

Description

WordPiece wraps a Vocabulary with optional lowercase normalization. words splits input by whitespace and punctuation, with special handling for CJK Unicode ranges (treating each CJK character as a separate word). Encode uses a greedy longest-match-first algorithm: for each word, it tries the longest possible subword starting from the beginning; if not found in vocabulary, it shortens by one character and retries. Uses GGML-style word boundary prefix ([U+2581]) instead of the original WordPiece "##" subword prefix. Decode handles the reverse mapping with common English contraction rules.

Usage

Used for BERT-based models (embedding models, rerankers) that use WordPiece tokenization.

Code Reference

Source Location

Repository: Ollama
File: tokenizer/wordpiece.go
Lines: 1-171

Signature

type WordPiece struct {
    vocab     *Vocabulary
    lowercase bool
}

func NewWordPiece(vocab *Vocabulary, lowercase bool) WordPiece
func (wpm WordPiece) Encode(s string, addSpecial bool) ([]int32, error)
func (wpm WordPiece) Decode(ids []int32) (string, error)
func (wpm WordPiece) Is(id int32, special Special) bool
func (wpm WordPiece) Vocabulary() *Vocabulary

var _ Tokenizer = (*WordPiece)(nil)

Import

import "github.com/ollama/ollama/tokenizer"

I/O Contract

Inputs

Name	Type	Required	Description
s	string	Yes	Text to tokenize
addSpecial	bool	Yes	Whether to add BOS/EOS tokens
lowercase	bool	Yes	Whether to lowercase input before encoding

Outputs

Name	Type	Description
ids	[]int32	Token IDs
error	error	Encoding error

Usage Examples

vocab := &tokenizer.Vocabulary{...}
wpm := tokenizer.NewWordPiece(vocab, true) // lowercase=true for uncased BERT

ids, err := wpm.Encode("Hello world", true)
// Encodes with BOS/EOS tokens

text, err := wpm.Decode(ids)
// Reconstructs original text with contraction fixes

Related Pages

Principle:Ollama_Ollama_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment