Implementation:Microsoft LoRA GPT2 Decode Script

Overview

GPT2_Decode_Script converts beam search output (JSONL of token IDs) back into human-readable prediction and reference text files for evaluation. It uses the GPT-2 BPE decoder to map token IDs to text, applies post-processing (truncation, optional tokenization and lowercasing), and formats references according to dataset-specific conventions (E2E vs. WebNLG/DART).

Type

API Doc

Source

examples/NLG/src/gpt2_decode.py (lines 65-163)
examples/NLG/src/encoder.py (lines 46-132)

CLI Signature

python src/gpt2_decode.py \
    --vocab <vocab_dir> --sample_file <beam_output.jsonl> \
    --input_file <original_data.jsonl> \
    --output_ref_file <ref_path> --output_pred_file <pred_path> \
    --ref_type <e2e|webnlg|dart> --ref_num <N> \
    [--tokenize] [--lower] [--filter <all|seen|unseen>]

Argument reference:

Argument	Type	Default	Description
`--vocab`	str	None	Directory containing `encoder.json` and `vocab.bpe`
`--sample_file`	str	None	Beam search output JSONL (`{"id": int, "predict": [...]}`)
`--input_file`	str	None	Original BPE-encoded data JSONL (for references)
`--output_ref_file`	str	None	Output path for reference file(s)
`--output_pred_file`	str	None	Output path for prediction file
`--ref_type`	str	e2e	Reference format: e2e, webnlg, or dart
`--ref_num`	int	4	Number of reference files (for webnlg/dart)
`--tokenize`	flag	False	Apply word tokenization to outputs
`--lower`	flag	False	Lowercase all outputs
`--filter`	str	all	WebNLG category filter: all, seen, unseen
`--ref_unique_file`	str	None	Optional unique ID file for reference grouping

Key Internal Components

Encoder.decode (encoder.py:117-120)

class Encoder:
    def decode(self, tokens):
        text = ''.join([self.decoder[token] for token in tokens])
        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors='replace')
        return text

Maps a list of integer token IDs back to a UTF-8 string through the reverse BPE lookup table and byte-to-unicode mapping.

encoder.get_encoder (encoder.py:123-132)

def get_encoder(models_dir):
    with open(os.path.join(models_dir, 'encoder.json'), 'r') as f:
        encoder = json.load(f)
    with open(os.path.join(models_dir, 'vocab.bpe'), 'r', encoding="utf-8") as f:
        bpe_data = f.read()
    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
    return Encoder(encoder=encoder, bpe_merges=bpe_merges)

Post-Processing Functions (gpt2_decode.py:49-62)

def stardard_tokenize(sent):
    sent = ' '.join(re.split('(\W)', sent))
    sent = sent.split()
    sent = ' '.join(sent)
    return sent

def post_process(sent, is_tokenize, is_lower):
    if is_lower:
        sent = sent.lower()
    if is_tokenize:
        sent = stardard_tokenize(sent)
    return sent

Reference Formatting Logic

The main script groups references by context (or by unique ID if --ref_unique_file is provided). Predictions are decoded from beam search output:

refer_dict[_key]['sample'] = enc.decode(_pred_tokens).split('<|endoftext|>')[0].split('\n\n')[0].strip()

For E2E format, references are written as groups separated by blank lines. For WebNLG/DART format, references are distributed across ref_num separate files (reference0, reference1, etc.).

Input / Output

Direction	Description
Input	Beam search JSONL (`{"id": int, "predict": [token_ids]}`) Original data JSONL (for extracting reference completions) BPE vocabulary files (`encoder.json`, `vocab.bpe`)
Output	Plain text prediction file (one hypothesis per line) Plain text reference file(s): E2E: single file with grouped references separated by blank lines WebNLG/DART: directory with `reference0` through `reference{N-1}`

Metadata

Field	Value
Source	microsoft/LoRA
Type	API Doc
Last Updated	2026-02-10

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment