Principle:Tensorflow Tfjs BPE Tokenization
Summary
Byte Pair Encoding (BPE) tokenization is a subword tokenization algorithm used extensively in modern language models. BPE iteratively merges the most frequent byte or character pairs to build a vocabulary of variable-length tokens, enabling a model to handle rare and out-of-vocabulary words by decomposing them into known subword units.
This principle is library-agnostic and applies to any framework implementing BPE-based tokenization for NLP pipelines.
Theory
BPE starts with a character-level (or byte-level) vocabulary and iteratively merges the most frequent adjacent pairs. Given a vocabulary V and merge rules M ordered by priority, the algorithm proceeds as follows:
- Split the input text into individual characters (or bytes).
- For each merge rule (a, b) → ab in M (in priority order), find all adjacent occurrences of (a, b) in the current token sequence and replace them with the merged token ab.
- Map the final tokens to vocabulary IDs using the vocabulary V.
Byte-Level BPE (GPT-2 Variant)
GPT-2 uses byte-level BPE where the base vocabulary consists of 256 byte values. This ensures that any Unicode text can be tokenized without unknown tokens, since every possible byte is in the base vocabulary. The merge rules are learned from a large training corpus by greedily selecting the most frequent adjacent byte pairs.
| Step | Operation | Description |
|---|---|---|
| 1 | Character/Byte Split | Input text is split into a sequence of bytes (256 possible values) |
| 2 | Merge Application | Merge rules are applied iteratively in priority order |
| 3 | Vocabulary Lookup | Final subword tokens are mapped to integer IDs |
Formal Definition
Given:
- A base vocabulary V0 = {all 256 byte values}
- A set of learned merge rules M = [(a1, b1), (a2, b2), ..., (ak, bk)] ordered by frequency
The encoding function encode(text) produces a sequence of token IDs by:
- Splitting text into bytes: [c1, c2, ..., cn]
- For each rule (ai, bi) in M, replacing all adjacent pairs matching (ai, bi) with the merged symbol
- Looking up each resulting token in V to obtain integer IDs
Academic Reference
Neural Machine Translation of Rare Words with Subword Units
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. This paper introduced the application of BPE to neural machine translation, demonstrating that subword units significantly improve handling of rare and unseen words.
Key Properties
- Deterministic encoding: Given the same merge rules, the same input always produces the same token sequence.
- Variable-length tokens: Common words become single tokens; rare words are decomposed into multiple subword tokens.
- Complete coverage: Byte-level BPE guarantees that any input text can be tokenized (no unknown tokens).
- Compact vocabulary: Typical GPT-2 vocabulary size is 50,257 tokens, balancing granularity and efficiency.
Implementation
Implementation:Tensorflow_Tfjs_GPT2Tokenizer_Constructor
Domains
Sources
TensorFlow.js Neural Machine Translation of Rare Words with Subword Units