Implementation:Microsoft LoRA GPT2 Encode Pipeline
Overview
GPT2_Encode_Pipeline is the concrete implementation of the two-stage dataset preparation process that converts raw NLG datasets (E2E, DART, WebNLG) into BPE-encoded JSONL files consumed by the GPT-2 LoRA fine-tuning pipeline. It consists of three format conversion scripts, one BPE encoding script, and an orchestrating shell script.
Type
API Doc
Source
examples/NLG/src/format_converting_e2e.py(lines 1-20)examples/NLG/src/format_converting_dart.py(lines 1-43)examples/NLG/src/format_converting_webnlg.py(lines 1-68)examples/NLG/src/gpt2_encode.py(lines 1-70)examples/NLG/create_datasets.sh(lines 1-44)
Signatures
Format Conversion Scripts
Each format converter takes a raw input file and produces a JSONL output file with {"context": str, "completion": str} on each line.
E2E Format Converter
python src/format_converting_e2e.py <input_txt> <output_jsonl>
Reads pipe-delimited text (context || completion) and writes JSONL. Each line is split on ||, with the first part as context and the second as completion.
DART Format Converter
python src/format_converting_dart.py <input_json> <output_jsonl>
Reads DART JSON (array of objects with tripleset and annotations). Triples are linearized as subject : relation : object joined by |. Each annotation produces a separate JSONL line.
WebNLG Format Converter
python src/format_converting_webnlg.py <input_json> <output_jsonl>
Reads WebNLG JSON (dict with entries). Each entry has modifiedtripleset (list of {subject, property, object}) and lexicalisations. Triples are linearized as subject : property : object joined by |. Only lexicalisations with "comment": "good" are included. A boolean cate field is added indicating whether the category is among 10 seen categories.
BPE Encoding Script
python src/gpt2_encode.py --vocab <vocab_dir> --input <input_jsonl> --output <output_jsonl> [--add_bos] [--add_eos]
Arguments:
| Argument | Type | Description |
|---|---|---|
--vocab |
str | Path to directory containing encoder.json and vocab.bpe
|
--input |
str | Path to input JSONL (text context/completion) |
--output |
str | Path to output JSONL (BPE token IDs) |
--add_bos |
flag | Append BOS token (50256) to context |
--add_eos |
flag | Append EOS token (50256) to completion |
The encoding logic uses the encoder.Encoder class to convert text to BPE tokens:
enc = encoder.get_encoder(args.vocab)
context_bpes, _ = enc.encode(context)
context_bpes += [50256] if args.add_bos else []
completion_bpes, _ = enc.encode(' ' + completion)
completion_bpes += [50256] if args.add_eos else []
Orchestration Script
bash create_datasets.sh
This shell script runs both stages for all three datasets (E2E, WebNLG, DART) across all splits (train, valid, test). For example, the E2E processing steps:
python src/format_converting_e2e.py data/e2e/train.txt data/e2e/train_formatted.jsonl
python src/gpt2_encode.py --vocab vocab --input data/e2e/train_formatted.jsonl --output data/e2e/train.jsonl --add_bos --add_eos
Input / Output
| Direction | Description |
|---|---|
| Input | Raw dataset files:
|
| Output | BPE-encoded JSONL files with integer token ID arrays:
|
Metadata
| Field | Value |
|---|---|
| Source | microsoft/LoRA |
| Type | API Doc |
| Last Updated | 2026-02-10 |