Implementation:LLMBook zh LLMBook zh github io Encode With BPE

Knowledge Sources	LLMBook-zh
Domains	NLP, Tokenization
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for BPE tokenizer training from scratch provided by the LLMBook repository.

Description

The encode_with_bpe function implements the complete Byte Pair Encoding algorithm. It uses three supporting functions: extract_frequencies (initializes character-level vocabulary), frequency_of_pairs (counts adjacent symbol pairs), and merge_vocab (merges the most frequent pair). The implementation builds a BPE vocabulary by iteratively merging the most common character pairs.

Usage

Use this function when you need to train a BPE tokenizer from scratch on a text corpus, as a pedagogical implementation of the BPE algorithm.

Code Reference

Source Location

Repository: LLMBook-zh
File: code/4.4 BPE分词.py
Lines: 61-80 (encode_with_bpe); 4-20 (extract_frequencies); 22-38 (frequency_of_pairs); 40-59 (merge_vocab)

Signature

def encode_with_bpe(texts: list[str], num_merges: int) -> Counter:
    """
    Trains a BPE tokenizer by iteratively merging the most frequent character pairs.

    Args:
        texts: List of input text strings.
        num_merges: Maximum number of BPE merge operations.

    Returns:
        Counter mapping BPE tokens to their frequencies.
    """

def extract_frequencies(texts: list[str]) -> Counter:
    """Converts texts to character sequences with </w> end marker and counts frequencies."""

def frequency_of_pairs(frequencies: Counter) -> Counter:
    """Counts frequencies of all adjacent symbol pairs in the vocabulary."""

def merge_vocab(pair: tuple, vocab: Counter) -> Counter:
    """Merges all occurrences of the given pair in the vocabulary."""

Import

from bpe_tokenizer import encode_with_bpe, extract_frequencies, frequency_of_pairs, merge_vocab

I/O Contract

Inputs

Name	Type	Required	Description
texts	list[str]	Yes	List of input text strings to build vocabulary from
num_merges	int	Yes	Maximum number of BPE merge operations to perform

Outputs

Name	Type	Description
return	Counter	Merged vocabulary mapping BPE tokens to their frequencies

Usage Examples

from collections import Counter
from bpe_tokenizer import encode_with_bpe

# Sample corpus
texts = ["low lower newest", "widest wider new"]

# Train BPE with 1000 merges
num_merges = 1000
bpe_vocab = encode_with_bpe(texts, num_merges)

# Inspect the learned vocabulary
for token, freq in bpe_vocab.most_common(10):
    print(f"{token}: {freq}")

Related Pages

Implements Principle

Principle:LLMBook_zh_LLMBook_zh_github_io_BPE_Tokenization

Requires Environment

Environment:LLMBook_zh_LLMBook_zh_github_io_Data_Processing_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment