Implementation:LLMBook zh LLMBook zh github io Encode With BPE
| Knowledge Sources | |
|---|---|
| Domains | NLP, Tokenization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for BPE tokenizer training from scratch provided by the LLMBook repository.
Description
The encode_with_bpe function implements the complete Byte Pair Encoding algorithm. It uses three supporting functions: extract_frequencies (initializes character-level vocabulary), frequency_of_pairs (counts adjacent symbol pairs), and merge_vocab (merges the most frequent pair). The implementation builds a BPE vocabulary by iteratively merging the most common character pairs.
Usage
Use this function when you need to train a BPE tokenizer from scratch on a text corpus, as a pedagogical implementation of the BPE algorithm.
Code Reference
Source Location
- Repository: LLMBook-zh
- File: code/4.4 BPE分词.py
- Lines: 61-80 (encode_with_bpe); 4-20 (extract_frequencies); 22-38 (frequency_of_pairs); 40-59 (merge_vocab)
Signature
def encode_with_bpe(texts: list[str], num_merges: int) -> Counter:
"""
Trains a BPE tokenizer by iteratively merging the most frequent character pairs.
Args:
texts: List of input text strings.
num_merges: Maximum number of BPE merge operations.
Returns:
Counter mapping BPE tokens to their frequencies.
"""
def extract_frequencies(texts: list[str]) -> Counter:
"""Converts texts to character sequences with </w> end marker and counts frequencies."""
def frequency_of_pairs(frequencies: Counter) -> Counter:
"""Counts frequencies of all adjacent symbol pairs in the vocabulary."""
def merge_vocab(pair: tuple, vocab: Counter) -> Counter:
"""Merges all occurrences of the given pair in the vocabulary."""
Import
from bpe_tokenizer import encode_with_bpe, extract_frequencies, frequency_of_pairs, merge_vocab
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| texts | list[str] | Yes | List of input text strings to build vocabulary from |
| num_merges | int | Yes | Maximum number of BPE merge operations to perform |
Outputs
| Name | Type | Description |
|---|---|---|
| return | Counter | Merged vocabulary mapping BPE tokens to their frequencies |
Usage Examples
from collections import Counter
from bpe_tokenizer import encode_with_bpe
# Sample corpus
texts = ["low lower newest", "widest wider new"]
# Train BPE with 1000 merges
num_merges = 1000
bpe_vocab = encode_with_bpe(texts, num_merges)
# Inspect the learned vocabulary
for token, freq in bpe_vocab.most_common(10):
print(f"{token}: {freq}")