Principle:LLMBook zh LLMBook zh github io BPE Tokenization

Knowledge Sources	Neural Machine Translation of Rare Words with Subword Units LLMBook-zh
Domains	NLP, Tokenization
Last Updated	2026-02-08 00:00 GMT

Overview

A subword tokenization algorithm that iteratively merges the most frequent adjacent character pairs to build a vocabulary of variable-length tokens.

Description

Byte Pair Encoding (BPE) was originally a data compression algorithm adapted for NLP tokenization by Sennrich et al. (2016). It addresses the open-vocabulary problem in neural machine translation by decomposing rare words into frequent subword units.

BPE starts with a character-level vocabulary and iteratively merges the most frequent pair of adjacent symbols. After a fixed number of merge operations, the resulting vocabulary contains a mix of individual characters and common subwords, enabling efficient representation of both frequent and rare words.

BPE is the foundation for tokenizers used in GPT-2, GPT-3, LLaMA, and many other large language models.

Usage

Use this principle when designing a tokenizer for LLM pre-training. BPE is preferred when you need a language-agnostic subword tokenizer that can handle open vocabularies across multiple languages. The number of merge operations controls the vocabulary size.

Theoretical Basis

The BPE algorithm proceeds as follows:

Initialize: Split all words into individual characters, appending an end-of-word marker (e.g., </w>).
Count pairs: Count the frequency of all adjacent character pairs across the corpus.
Merge: Replace all occurrences of the most frequent pair with a single merged token.
Repeat: Go to step 2 until the desired number of merges is reached.

Pseudo-code:

# Abstract BPE algorithm (NOT real implementation)
vocab = initialize_character_vocab(corpus)
for i in range(num_merges):
    pairs = count_adjacent_pairs(vocab)
    if not pairs:
        break
    best_pair = most_frequent(pairs)
    vocab = merge_pair(best_pair, vocab)
return vocab

The time complexity per merge step is O(N) where N is the corpus size in tokens. The total complexity is O(N × M) where M is the number of merges.

Related Pages

Implemented By

Implementation:LLMBook_zh_LLMBook_zh_github_io_Encode_With_BPE

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment