Principle:Ggml org Ggml BPE Tokenization
NLP Tokenization GGML BPE Last updated: 2025-05-15 12:00 GMT
Summary
Byte Pair Encoding (BPE) is a subword tokenization algorithm widely used in modern language models. It enables open-vocabulary text processing by iteratively building a vocabulary of subword units from a base set of symbols.
Theory
BPE works by iteratively merging the most frequent pair of adjacent symbols in a corpus to construct a vocabulary. Starting from a base alphabet (individual characters or bytes), the algorithm repeatedly identifies the most common bigram and replaces all its occurrences with a new merged symbol. This process continues for a fixed number of merge operations, producing a vocabulary of subword units that balances between character-level and word-level representations.
The algorithm was originally proposed by Sennrich et al. (2016) for neural machine translation, adapting the classic BPE compression algorithm to the problem of open-vocabulary translation.
Byte-Level BPE (GPT-2 Variant)
GPT-2 uses a byte-level BPE variant where the base vocabulary consists of all 256 byte values (0-255), extended with merge operations learned from training data. This guarantees that any input text can be encoded without unknown tokens, since every possible byte sequence has a valid decomposition.
Encoding and Decoding
The tokenization pipeline operates in two directions:
Encoding:
- Raw text is split into words (via regex or whitespace rules)
- Each word is decomposed into its base symbols
- BPE merge rules are applied iteratively to combine adjacent symbols
- The resulting subword tokens are mapped to integer IDs via a vocabulary lookup
Decoding:
- Token IDs are mapped back to their string representations via an inverse vocabulary lookup
- Token strings are concatenated to reconstruct the original text