Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Tokenizer WordPiece

From Leeroopedia
Knowledge Sources
Domains Tokenization, Text Processing
Last Updated 2025-02-15 00:00 GMT

Overview

Implements WordPiece tokenization, the algorithm used by BERT-style models, with CJK character handling and longest-match-first subword decomposition.

Description

WordPiece wraps a Vocabulary with optional lowercase normalization. words splits input by whitespace and punctuation, with special handling for CJK Unicode ranges (treating each CJK character as a separate word). Encode uses a greedy longest-match-first algorithm: for each word, it tries the longest possible subword starting from the beginning; if not found in vocabulary, it shortens by one character and retries. Uses GGML-style word boundary prefix ([U+2581]) instead of the original WordPiece "##" subword prefix. Decode handles the reverse mapping with common English contraction rules.

Usage

Used for BERT-based models (embedding models, rerankers) that use WordPiece tokenization.

Code Reference

Source Location

  • Repository: Ollama
  • File: tokenizer/wordpiece.go
  • Lines: 1-171

Signature

type WordPiece struct {
    vocab     *Vocabulary
    lowercase bool
}

func NewWordPiece(vocab *Vocabulary, lowercase bool) WordPiece
func (wpm WordPiece) Encode(s string, addSpecial bool) ([]int32, error)
func (wpm WordPiece) Decode(ids []int32) (string, error)
func (wpm WordPiece) Is(id int32, special Special) bool
func (wpm WordPiece) Vocabulary() *Vocabulary

var _ Tokenizer = (*WordPiece)(nil)

Import

import "github.com/ollama/ollama/tokenizer"

I/O Contract

Inputs

Name Type Required Description
s string Yes Text to tokenize
addSpecial bool Yes Whether to add BOS/EOS tokens
lowercase bool Yes Whether to lowercase input before encoding

Outputs

Name Type Description
ids []int32 Token IDs
error error Encoding error

Usage Examples

vocab := &tokenizer.Vocabulary{...}
wpm := tokenizer.NewWordPiece(vocab, true) // lowercase=true for uncased BERT

ids, err := wpm.Encode("Hello world", true)
// Encodes with BOS/EOS tokens

text, err := wpm.Decode(ids)
// Reconstructs original text with contraction fixes

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment