Principle:Ggml org Ggml BPE Tokenization

NLP Tokenization GGML BPE Last updated: 2025-05-15 12:00 GMT

Summary

Byte Pair Encoding (BPE) is a subword tokenization algorithm widely used in modern language models. It enables open-vocabulary text processing by iteratively building a vocabulary of subword units from a base set of symbols.

Theory

BPE works by iteratively merging the most frequent pair of adjacent symbols in a corpus to construct a vocabulary. Starting from a base alphabet (individual characters or bytes), the algorithm repeatedly identifies the most common bigram and replaces all its occurrences with a new merged symbol. This process continues for a fixed number of merge operations, producing a vocabulary of subword units that balances between character-level and word-level representations.

The algorithm was originally proposed by Sennrich et al. (2016) for neural machine translation, adapting the classic BPE compression algorithm to the problem of open-vocabulary translation.

Byte-Level BPE (GPT-2 Variant)

GPT-2 uses a byte-level BPE variant where the base vocabulary consists of all 256 byte values (0-255), extended with merge operations learned from training data. This guarantees that any input text can be encoded without unknown tokens, since every possible byte sequence has a valid decomposition.

Encoding and Decoding

The tokenization pipeline operates in two directions:

Encoding:

Raw text is split into words (via regex or whitespace rules)
Each word is decomposed into its base symbols
BPE merge rules are applied iteratively to combine adjacent symbols
The resulting subword tokens are mapped to integer IDs via a vocabulary lookup

Decoding:

Token IDs are mapped back to their string representations via an inverse vocabulary lookup
Token strings are concatenated to reconstruct the original text

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment