Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA GPT2 Encode Pipeline

From Leeroopedia


Overview

GPT2_Encode_Pipeline is the concrete implementation of the two-stage dataset preparation process that converts raw NLG datasets (E2E, DART, WebNLG) into BPE-encoded JSONL files consumed by the GPT-2 LoRA fine-tuning pipeline. It consists of three format conversion scripts, one BPE encoding script, and an orchestrating shell script.

Type

API Doc

Source

  • examples/NLG/src/format_converting_e2e.py (lines 1-20)
  • examples/NLG/src/format_converting_dart.py (lines 1-43)
  • examples/NLG/src/format_converting_webnlg.py (lines 1-68)
  • examples/NLG/src/gpt2_encode.py (lines 1-70)
  • examples/NLG/create_datasets.sh (lines 1-44)

Signatures

Format Conversion Scripts

Each format converter takes a raw input file and produces a JSONL output file with {"context": str, "completion": str} on each line.

E2E Format Converter

python src/format_converting_e2e.py <input_txt> <output_jsonl>

Reads pipe-delimited text (context || completion) and writes JSONL. Each line is split on ||, with the first part as context and the second as completion.

DART Format Converter

python src/format_converting_dart.py <input_json> <output_jsonl>

Reads DART JSON (array of objects with tripleset and annotations). Triples are linearized as subject : relation : object joined by |. Each annotation produces a separate JSONL line.

WebNLG Format Converter

python src/format_converting_webnlg.py <input_json> <output_jsonl>

Reads WebNLG JSON (dict with entries). Each entry has modifiedtripleset (list of {subject, property, object}) and lexicalisations. Triples are linearized as subject : property : object joined by |. Only lexicalisations with "comment": "good" are included. A boolean cate field is added indicating whether the category is among 10 seen categories.

BPE Encoding Script

python src/gpt2_encode.py --vocab <vocab_dir> --input <input_jsonl> --output <output_jsonl> [--add_bos] [--add_eos]

Arguments:

Argument Type Description
--vocab str Path to directory containing encoder.json and vocab.bpe
--input str Path to input JSONL (text context/completion)
--output str Path to output JSONL (BPE token IDs)
--add_bos flag Append BOS token (50256) to context
--add_eos flag Append EOS token (50256) to completion

The encoding logic uses the encoder.Encoder class to convert text to BPE tokens:

enc = encoder.get_encoder(args.vocab)
context_bpes, _ = enc.encode(context)
context_bpes += [50256] if args.add_bos else []
completion_bpes, _ = enc.encode(' ' + completion)
completion_bpes += [50256] if args.add_eos else []

Orchestration Script

bash create_datasets.sh

This shell script runs both stages for all three datasets (E2E, WebNLG, DART) across all splits (train, valid, test). For example, the E2E processing steps:

python src/format_converting_e2e.py data/e2e/train.txt data/e2e/train_formatted.jsonl
python src/gpt2_encode.py --vocab vocab --input data/e2e/train_formatted.jsonl --output data/e2e/train.jsonl --add_bos --add_eos

Input / Output

Direction Description
Input Raw dataset files:
  • E2E: data/e2e/{train,valid,test}.txt
  • DART: data/dart/dart-v1.1.1-full-{train,dev,test}.json
  • WebNLG: data/webnlg_challenge_2017/{train,dev,test}.json
Output BPE-encoded JSONL files with integer token ID arrays:
  • data/e2e/{train,valid,test}.jsonl
  • data/dart/{train,valid,test}.jsonl
  • data/webnlg_challenge_2017/{train,valid,test}.jsonl

Metadata

Field Value
Source microsoft/LoRA
Type API Doc
Last Updated 2026-02-10

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment