Principle:Microsoft LoRA BPE Text Decoding
Overview
BPE Text Decoding is the principle of converting BPE (Byte Pair Encoding) token IDs produced by beam search back into human-readable text, and formatting the output into prediction and reference files suitable for automatic evaluation. This step bridges the gap between the model's numerical output space and the natural language metrics used to assess generation quality.
Description
BPE Decoding
GPT-2's BPE tokenizer maps text to integer token IDs through a reversible encoding process:
- Text to bytes: Each character is mapped to its UTF-8 byte representation.
- Bytes to unicode: Each byte is mapped to a printable unicode character via a lookup table (to avoid whitespace/control characters that would interfere with BPE).
- BPE merges: Character sequences are iteratively merged according to a learned merge table (
vocab.bpe), producing subword tokens. - Token to ID: Each subword token is mapped to an integer via the encoder dictionary (
encoder.json).
Decoding reverses this process:
- ID to token: Each integer is mapped back to its subword string via the decoder dictionary.
- Unicode to bytes: Each unicode character is mapped back to its original byte value.
- Bytes to text: The byte sequence is decoded as UTF-8 to produce the final text string.
The Encoder.decode(tokens) method performs all three steps:
def decode(self, tokens):
text = ''.join([self.decoder[token] for token in tokens])
text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors='replace')
return text
Post-Processing
After BPE decoding, the raw text undergoes additional cleaning:
- EOS truncation: The decoded text is split on
<|endoftext|>and only the portion before the first EOS is retained. - Paragraph truncation: The text is split on double newlines (
\n\n) and only the first paragraph is kept. - Whitespace stripping: Leading and trailing whitespace is removed.
- Optional tokenization: When the
--tokenizeflag is set, the text is tokenized by splitting on word boundaries using a regex pattern and rejoining with spaces. - Optional lowercasing: When the
--lowerflag is set, the text is converted to lowercase.
Reference File Formatting
The decoding script produces two types of output files depending on the dataset format:
E2E Format
References are grouped by context, with multiple references per group separated by newlines and groups separated by blank lines. The prediction file contains one hypothesis per line, aligned with the reference groups.
WebNLG / DART Format
References are stored in separate numbered files (reference0, reference1, ..., reference{N-1}) within a directory. Each file contains one reference per line aligned with the predictions. When fewer references exist than the specified count, the first reference is duplicated to fill the gap.
Theoretical Basis
The reversibility of BPE encoding is guaranteed by the bijective mapping between bytes and unicode characters. The bytes_to_unicode() function constructs a mapping from all 256 possible byte values to unique printable unicode characters, ensuring that no information is lost during the encoding-decoding round trip. The only source of potential information loss is the errors='replace' parameter in UTF-8 decoding, which replaces invalid byte sequences with the Unicode replacement character.
Metadata
| Field | Value |
|---|---|
| Source | microsoft/LoRA |
| Domains | Decoding, NLG |
| Type | External Tool Doc |
| Last Updated | 2026-02-10 |