Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LLMBook zh LLMBook zh github io BPE Tokenization

From Leeroopedia


Knowledge Sources
Domains NLP, Tokenization
Last Updated 2026-02-08 00:00 GMT

Overview

A subword tokenization algorithm that iteratively merges the most frequent adjacent character pairs to build a vocabulary of variable-length tokens.

Description

Byte Pair Encoding (BPE) was originally a data compression algorithm adapted for NLP tokenization by Sennrich et al. (2016). It addresses the open-vocabulary problem in neural machine translation by decomposing rare words into frequent subword units.

BPE starts with a character-level vocabulary and iteratively merges the most frequent pair of adjacent symbols. After a fixed number of merge operations, the resulting vocabulary contains a mix of individual characters and common subwords, enabling efficient representation of both frequent and rare words.

BPE is the foundation for tokenizers used in GPT-2, GPT-3, LLaMA, and many other large language models.

Usage

Use this principle when designing a tokenizer for LLM pre-training. BPE is preferred when you need a language-agnostic subword tokenizer that can handle open vocabularies across multiple languages. The number of merge operations controls the vocabulary size.

Theoretical Basis

The BPE algorithm proceeds as follows:

  1. Initialize: Split all words into individual characters, appending an end-of-word marker (e.g., </w>).
  2. Count pairs: Count the frequency of all adjacent character pairs across the corpus.
  3. Merge: Replace all occurrences of the most frequent pair with a single merged token.
  4. Repeat: Go to step 2 until the desired number of merges is reached.

Pseudo-code:

# Abstract BPE algorithm (NOT real implementation)
vocab = initialize_character_vocab(corpus)
for i in range(num_merges):
    pairs = count_adjacent_pairs(vocab)
    if not pairs:
        break
    best_pair = most_frequent(pairs)
    vocab = merge_pair(best_pair, vocab)
return vocab

The time complexity per merge step is O(N) where N is the corpus size in tokens. The total complexity is O(N × M) where M is the number of merges.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment