Implementation:Microsoft LoRA GPT2 Encode Pipeline

Overview

GPT2_Encode_Pipeline is the concrete implementation of the two-stage dataset preparation process that converts raw NLG datasets (E2E, DART, WebNLG) into BPE-encoded JSONL files consumed by the GPT-2 LoRA fine-tuning pipeline. It consists of three format conversion scripts, one BPE encoding script, and an orchestrating shell script.

Type

API Doc

Source

examples/NLG/src/format_converting_e2e.py (lines 1-20)
examples/NLG/src/format_converting_dart.py (lines 1-43)
examples/NLG/src/format_converting_webnlg.py (lines 1-68)
examples/NLG/src/gpt2_encode.py (lines 1-70)
examples/NLG/create_datasets.sh (lines 1-44)

Signatures

Format Conversion Scripts

Each format converter takes a raw input file and produces a JSONL output file with {"context": str, "completion": str} on each line.

E2E Format Converter

python src/format_converting_e2e.py <input_txt> <output_jsonl>

Reads pipe-delimited text (context || completion) and writes JSONL. Each line is split on ||, with the first part as context and the second as completion.

DART Format Converter

python src/format_converting_dart.py <input_json> <output_jsonl>

Reads DART JSON (array of objects with tripleset and annotations). Triples are linearized as subject : relation : object joined by |. Each annotation produces a separate JSONL line.

WebNLG Format Converter

python src/format_converting_webnlg.py <input_json> <output_jsonl>

Reads WebNLG JSON (dict with entries). Each entry has modifiedtripleset (list of {subject, property, object}) and lexicalisations. Triples are linearized as subject : property : object joined by |. Only lexicalisations with "comment": "good" are included. A boolean cate field is added indicating whether the category is among 10 seen categories.

BPE Encoding Script

python src/gpt2_encode.py --vocab <vocab_dir> --input <input_jsonl> --output <output_jsonl> [--add_bos] [--add_eos]

Arguments:

Argument	Type	Description
`--vocab`	str	Path to directory containing `encoder.json` and `vocab.bpe`
`--input`	str	Path to input JSONL (text context/completion)
`--output`	str	Path to output JSONL (BPE token IDs)
`--add_bos`	flag	Append BOS token (50256) to context
`--add_eos`	flag	Append EOS token (50256) to completion

The encoding logic uses the encoder.Encoder class to convert text to BPE tokens:

enc = encoder.get_encoder(args.vocab)
context_bpes, _ = enc.encode(context)
context_bpes += [50256] if args.add_bos else []
completion_bpes, _ = enc.encode(' ' + completion)
completion_bpes += [50256] if args.add_eos else []

Orchestration Script

bash create_datasets.sh

This shell script runs both stages for all three datasets (E2E, WebNLG, DART) across all splits (train, valid, test). For example, the E2E processing steps:

python src/format_converting_e2e.py data/e2e/train.txt data/e2e/train_formatted.jsonl
python src/gpt2_encode.py --vocab vocab --input data/e2e/train_formatted.jsonl --output data/e2e/train.jsonl --add_bos --add_eos

Input / Output

Direction	Description
Input	Raw dataset files: E2E: `data/e2e/{train,valid,test}.txt` DART: `data/dart/dart-v1.1.1-full-{train,dev,test}.json` WebNLG: `data/webnlg_challenge_2017/{train,dev,test}.json`
Output	BPE-encoded JSONL files with integer token ID arrays: `data/e2e/{train,valid,test}.jsonl` `data/dart/{train,valid,test}.jsonl` `data/webnlg_challenge_2017/{train,valid,test}.jsonl`

Metadata

Field	Value
Source	microsoft/LoRA
Type	API Doc
Last Updated	2026-02-10

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment