Implementation:Microsoft LoRA GPT2 Decode Script
Overview
GPT2_Decode_Script converts beam search output (JSONL of token IDs) back into human-readable prediction and reference text files for evaluation. It uses the GPT-2 BPE decoder to map token IDs to text, applies post-processing (truncation, optional tokenization and lowercasing), and formats references according to dataset-specific conventions (E2E vs. WebNLG/DART).
Type
API Doc
Source
examples/NLG/src/gpt2_decode.py(lines 65-163)examples/NLG/src/encoder.py(lines 46-132)
CLI Signature
python src/gpt2_decode.py \
--vocab <vocab_dir> --sample_file <beam_output.jsonl> \
--input_file <original_data.jsonl> \
--output_ref_file <ref_path> --output_pred_file <pred_path> \
--ref_type <e2e|webnlg|dart> --ref_num <N> \
[--tokenize] [--lower] [--filter <all|seen|unseen>]
Argument reference:
| Argument | Type | Default | Description |
|---|---|---|---|
--vocab |
str | None | Directory containing encoder.json and vocab.bpe
|
--sample_file |
str | None | Beam search output JSONL ({"id": int, "predict": [...]})
|
--input_file |
str | None | Original BPE-encoded data JSONL (for references) |
--output_ref_file |
str | None | Output path for reference file(s) |
--output_pred_file |
str | None | Output path for prediction file |
--ref_type |
str | e2e | Reference format: e2e, webnlg, or dart |
--ref_num |
int | 4 | Number of reference files (for webnlg/dart) |
--tokenize |
flag | False | Apply word tokenization to outputs |
--lower |
flag | False | Lowercase all outputs |
--filter |
str | all | WebNLG category filter: all, seen, unseen |
--ref_unique_file |
str | None | Optional unique ID file for reference grouping |
Key Internal Components
Encoder.decode (encoder.py:117-120)
class Encoder:
def decode(self, tokens):
text = ''.join([self.decoder[token] for token in tokens])
text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors='replace')
return text
Maps a list of integer token IDs back to a UTF-8 string through the reverse BPE lookup table and byte-to-unicode mapping.
encoder.get_encoder (encoder.py:123-132)
def get_encoder(models_dir):
with open(os.path.join(models_dir, 'encoder.json'), 'r') as f:
encoder = json.load(f)
with open(os.path.join(models_dir, 'vocab.bpe'), 'r', encoding="utf-8") as f:
bpe_data = f.read()
bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
return Encoder(encoder=encoder, bpe_merges=bpe_merges)
Post-Processing Functions (gpt2_decode.py:49-62)
def stardard_tokenize(sent):
sent = ' '.join(re.split('(\W)', sent))
sent = sent.split()
sent = ' '.join(sent)
return sent
def post_process(sent, is_tokenize, is_lower):
if is_lower:
sent = sent.lower()
if is_tokenize:
sent = stardard_tokenize(sent)
return sent
Reference Formatting Logic
The main script groups references by context (or by unique ID if --ref_unique_file is provided). Predictions are decoded from beam search output:
refer_dict[_key]['sample'] = enc.decode(_pred_tokens).split('<|endoftext|>')[0].split('\n\n')[0].strip()
For E2E format, references are written as groups separated by blank lines. For WebNLG/DART format, references are distributed across ref_num separate files (reference0, reference1, etc.).
Input / Output
| Direction | Description |
|---|---|
| Input |
|
| Output |
|
Metadata
| Field | Value |
|---|---|
| Source | microsoft/LoRA |
| Type | API Doc |
| Last Updated | 2026-02-10 |