Principle:Microsoft LoRA NLG Dataset Preparation
Overview
NLG Dataset Preparation is the principle of converting raw NLG benchmark datasets into BPE-encoded JSONL format suitable for GPT-2 fine-tuning. This process transforms human-readable structured data (tables, triples, meaning representations) and their natural language references into sequences of integer token IDs that the GPT-2 model can consume during training and evaluation.
Description
The dataset preparation pipeline operates in two sequential stages:
Stage 1: Format Conversion (Raw to JSONL)
Each NLG benchmark has its own raw format. Format conversion scripts normalize these into a uniform JSONL schema where each line is a JSON object with two fields:
- context -- The structured input (meaning representation, triple set, or linearized table).
- completion -- The natural language reference text.
The three supported datasets and their raw formats are:
E2E (End-to-End)
- Raw format: Pipe-delimited text files (
context || completion). - Splits:
train.txt,valid.txt,test.txt. - Example context:
name[The Eagle], eatType[coffee shop], food[French]
DART
- Raw format: JSON array where each entry has a
tripleset(list of [subject, relation, object] triples) andannotations(list of text references). - Splits:
dart-v1.1.1-full-train.json,dart-v1.1.1-full-dev.json,dart-v1.1.1-full-test.json. - Linearization: Triples are joined as
subject : relation : object | subject : relation : object.
WebNLG
- Raw format: JSON dictionary with
entriescontainingmodifiedtripleset(list of {subject, property, object} dicts),lexicalisations, andcategory. - Splits:
train.json,dev.json,test.json. - Category tracking: WebNLG tracks 10 seen categories (Airport, Astronaut, Building, City, ComicsCharacter, Food, Monument, SportsTeam, University, WrittenWork) and flags each example with a boolean
catefield indicating whether the category was seen during training.
Stage 2: BPE Encoding (JSONL to Token IDs)
The second stage takes the normalized JSONL files and converts text strings into sequences of GPT-2 BPE token IDs. This uses the GPT-2 vocabulary files (encoder.json and vocab.bpe) to:
- Encode the context field into a list of integer token IDs.
- Encode the completion field into a list of integer token IDs, with a leading space prepended.
- Optionally prepend a BOS token (ID 50256) and append an EOS token (ID 50256) -- in GPT-2, the
<|endoftext|>token serves as both BOS and EOS.
The output JSONL preserves the same two-field schema but with integer lists instead of strings:
{"context": [1234, 5678, ...], "completion": [50256, 9012, ..., 50256]}
Theoretical Basis
Byte Pair Encoding (BPE) is a subword tokenization algorithm that iteratively merges the most frequent character pairs in a corpus. GPT-2 uses a BPE vocabulary of 50,257 tokens, which provides a balance between vocabulary size and sequence length. The BPE encoding step is critical because:
- It converts variable-length text into fixed-vocabulary integer sequences.
- Subword units handle rare and out-of-vocabulary words gracefully.
- The same vocabulary must be used for encoding training data, generating predictions, and decoding outputs.
Metadata
| Field | Value |
|---|---|
| Source | microsoft/LoRA |
| Domains | Data Processing, NLG |
| Type | External Tool Doc |
| Last Updated | 2026-02-10 |