Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama Tokenization

From Leeroopedia
Knowledge Sources
Domains Tokenization, NLP
Last Updated 2025-02-15 00:00 GMT

Overview

Tokenization converts raw text strings into sequences of integer token IDs (and vice versa) using subword algorithms such as Byte Pair Encoding, SentencePiece, and WordPiece, providing the fundamental text-to-number interface required for neural language model inference.

Core Concepts

Byte Pair Encoding (BPE)

BPE iteratively merges the most frequent pair of adjacent symbols in the training corpus until a desired vocabulary size is reached. At inference time, the tokenizer applies the learned merge rules in priority order to decompose input text into subword tokens. BPE handles out-of-vocabulary words gracefully by falling back to character-level or byte-level tokens. Modern variants like GPT-style BPE operate on byte sequences, ensuring any input text can be encoded without unknown tokens.

SentencePiece

SentencePiece treats the input as a raw byte stream (language-agnostic) and learns a subword vocabulary using either BPE or unigram language model algorithms. It uses a special whitespace marker to handle word boundaries without relying on language-specific pre-tokenization rules. This makes it particularly suitable for multilingual models. SentencePiece models are typically distributed as protobuf-encoded model files.

WordPiece

WordPiece, used primarily in BERT-family models, is similar to BPE but selects merges based on the likelihood increase of the training data rather than raw frequency. WordPiece uses a ## prefix convention to indicate continuation subwords (subwords that appear within a word rather than at the beginning). The tokenizer first identifies word boundaries and then applies subword decomposition within each word.

Vocabulary Management

All tokenizer variants share a common vocabulary structure that maps between token IDs and their string representations. The vocabulary includes regular tokens, special tokens (BOS, EOS, PAD, UNK), and potentially added tokens that were not part of the original training vocabulary. Special tokens have designated roles in the model's input format and must be handled separately from regular tokenization.

Encode and Decode

The tokenizer interface provides bidirectional conversion: Encode converts a text string to a sequence of token IDs, and Decode converts token IDs back to text. Encoding must handle special tokens, whitespace normalization, and byte-level fallback. Decoding must correctly reassemble text from subword pieces, handling continuation markers and byte tokens appropriately.

Implementation Notes

The tokenizer implementations reside in tokenizer/: BPE in tokenizer/bytepairencoding.go, SentencePiece in tokenizer/sentencepiece.go, WordPiece in tokenizer/wordpiece.go, and the shared vocabulary structure in tokenizer/vocabulary.go. A common Tokenizer interface in tokenizer/tokenizer.go unifies all implementations. During model conversion, tokenizer data is extracted from source models (in convert/tokenizer.go) and embedded in the GGUF metadata.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment