Principle:LLMBook zh LLMBook zh github io BPE Tokenization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Tokenization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
A subword tokenization algorithm that iteratively merges the most frequent adjacent character pairs to build a vocabulary of variable-length tokens.
Description
Byte Pair Encoding (BPE) was originally a data compression algorithm adapted for NLP tokenization by Sennrich et al. (2016). It addresses the open-vocabulary problem in neural machine translation by decomposing rare words into frequent subword units.
BPE starts with a character-level vocabulary and iteratively merges the most frequent pair of adjacent symbols. After a fixed number of merge operations, the resulting vocabulary contains a mix of individual characters and common subwords, enabling efficient representation of both frequent and rare words.
BPE is the foundation for tokenizers used in GPT-2, GPT-3, LLaMA, and many other large language models.
Usage
Use this principle when designing a tokenizer for LLM pre-training. BPE is preferred when you need a language-agnostic subword tokenizer that can handle open vocabularies across multiple languages. The number of merge operations controls the vocabulary size.
Theoretical Basis
The BPE algorithm proceeds as follows:
- Initialize: Split all words into individual characters, appending an end-of-word marker (e.g., </w>).
- Count pairs: Count the frequency of all adjacent character pairs across the corpus.
- Merge: Replace all occurrences of the most frequent pair with a single merged token.
- Repeat: Go to step 2 until the desired number of merges is reached.
Pseudo-code:
# Abstract BPE algorithm (NOT real implementation)
vocab = initialize_character_vocab(corpus)
for i in range(num_merges):
pairs = count_adjacent_pairs(vocab)
if not pairs:
break
best_pair = most_frequent(pairs)
vocab = merge_pair(best_pair, vocab)
return vocab
The time complexity per merge step is O(N) where N is the corpus size in tokens. The total complexity is O(N × M) where M is the number of merges.