Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA GPT2 Decode Script

From Leeroopedia


Overview

GPT2_Decode_Script converts beam search output (JSONL of token IDs) back into human-readable prediction and reference text files for evaluation. It uses the GPT-2 BPE decoder to map token IDs to text, applies post-processing (truncation, optional tokenization and lowercasing), and formats references according to dataset-specific conventions (E2E vs. WebNLG/DART).

Type

API Doc

Source

  • examples/NLG/src/gpt2_decode.py (lines 65-163)
  • examples/NLG/src/encoder.py (lines 46-132)

CLI Signature

python src/gpt2_decode.py \
    --vocab <vocab_dir> --sample_file <beam_output.jsonl> \
    --input_file <original_data.jsonl> \
    --output_ref_file <ref_path> --output_pred_file <pred_path> \
    --ref_type <e2e|webnlg|dart> --ref_num <N> \
    [--tokenize] [--lower] [--filter <all|seen|unseen>]

Argument reference:

Argument Type Default Description
--vocab str None Directory containing encoder.json and vocab.bpe
--sample_file str None Beam search output JSONL ({"id": int, "predict": [...]})
--input_file str None Original BPE-encoded data JSONL (for references)
--output_ref_file str None Output path for reference file(s)
--output_pred_file str None Output path for prediction file
--ref_type str e2e Reference format: e2e, webnlg, or dart
--ref_num int 4 Number of reference files (for webnlg/dart)
--tokenize flag False Apply word tokenization to outputs
--lower flag False Lowercase all outputs
--filter str all WebNLG category filter: all, seen, unseen
--ref_unique_file str None Optional unique ID file for reference grouping

Key Internal Components

Encoder.decode (encoder.py:117-120)

class Encoder:
    def decode(self, tokens):
        text = ''.join([self.decoder[token] for token in tokens])
        text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', errors='replace')
        return text

Maps a list of integer token IDs back to a UTF-8 string through the reverse BPE lookup table and byte-to-unicode mapping.

encoder.get_encoder (encoder.py:123-132)

def get_encoder(models_dir):
    with open(os.path.join(models_dir, 'encoder.json'), 'r') as f:
        encoder = json.load(f)
    with open(os.path.join(models_dir, 'vocab.bpe'), 'r', encoding="utf-8") as f:
        bpe_data = f.read()
    bpe_merges = [tuple(merge_str.split()) for merge_str in bpe_data.split('\n')[1:-1]]
    return Encoder(encoder=encoder, bpe_merges=bpe_merges)

Post-Processing Functions (gpt2_decode.py:49-62)

def stardard_tokenize(sent):
    sent = ' '.join(re.split('(\W)', sent))
    sent = sent.split()
    sent = ' '.join(sent)
    return sent

def post_process(sent, is_tokenize, is_lower):
    if is_lower:
        sent = sent.lower()
    if is_tokenize:
        sent = stardard_tokenize(sent)
    return sent

Reference Formatting Logic

The main script groups references by context (or by unique ID if --ref_unique_file is provided). Predictions are decoded from beam search output:

refer_dict[_key]['sample'] = enc.decode(_pred_tokens).split('<|endoftext|>')[0].split('\n\n')[0].strip()

For E2E format, references are written as groups separated by blank lines. For WebNLG/DART format, references are distributed across ref_num separate files (reference0, reference1, etc.).

Input / Output

Direction Description
Input
  • Beam search JSONL ({"id": int, "predict": [token_ids]})
  • Original data JSONL (for extracting reference completions)
  • BPE vocabulary files (encoder.json, vocab.bpe)
Output
  • Plain text prediction file (one hypothesis per line)
  • Plain text reference file(s):
    • E2E: single file with grouped references separated by blank lines
    • WebNLG/DART: directory with reference0 through reference{N-1}

Metadata

Field Value
Source microsoft/LoRA
Type API Doc
Last Updated 2026-02-10

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment