Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:LLMBook zh LLMBook zh github io Encode With BPE

From Leeroopedia
Revision as of 15:23, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/LLMBook_zh_LLMBook_zh_github_io_Encode_With_BPE.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Tokenization
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for BPE tokenizer training from scratch provided by the LLMBook repository.

Description

The encode_with_bpe function implements the complete Byte Pair Encoding algorithm. It uses three supporting functions: extract_frequencies (initializes character-level vocabulary), frequency_of_pairs (counts adjacent symbol pairs), and merge_vocab (merges the most frequent pair). The implementation builds a BPE vocabulary by iteratively merging the most common character pairs.

Usage

Use this function when you need to train a BPE tokenizer from scratch on a text corpus, as a pedagogical implementation of the BPE algorithm.

Code Reference

Source Location

  • Repository: LLMBook-zh
  • File: code/4.4 BPE分词.py
  • Lines: 61-80 (encode_with_bpe); 4-20 (extract_frequencies); 22-38 (frequency_of_pairs); 40-59 (merge_vocab)

Signature

def encode_with_bpe(texts: list[str], num_merges: int) -> Counter:
    """
    Trains a BPE tokenizer by iteratively merging the most frequent character pairs.

    Args:
        texts: List of input text strings.
        num_merges: Maximum number of BPE merge operations.

    Returns:
        Counter mapping BPE tokens to their frequencies.
    """

def extract_frequencies(texts: list[str]) -> Counter:
    """Converts texts to character sequences with </w> end marker and counts frequencies."""

def frequency_of_pairs(frequencies: Counter) -> Counter:
    """Counts frequencies of all adjacent symbol pairs in the vocabulary."""

def merge_vocab(pair: tuple, vocab: Counter) -> Counter:
    """Merges all occurrences of the given pair in the vocabulary."""

Import

from bpe_tokenizer import encode_with_bpe, extract_frequencies, frequency_of_pairs, merge_vocab

I/O Contract

Inputs

Name Type Required Description
texts list[str] Yes List of input text strings to build vocabulary from
num_merges int Yes Maximum number of BPE merge operations to perform

Outputs

Name Type Description
return Counter Merged vocabulary mapping BPE tokens to their frequencies

Usage Examples

from collections import Counter
from bpe_tokenizer import encode_with_bpe

# Sample corpus
texts = ["low lower newest", "widest wider new"]

# Train BPE with 1000 merges
num_merges = 1000
bpe_vocab = encode_with_bpe(texts, num_merges)

# Inspect the learned vocabulary
for token, freq in bpe_vocab.most_common(10):
    print(f"{token}: {freq}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment