Principle:Microsoft LoRA NLG Dataset Preparation

Overview

NLG Dataset Preparation is the principle of converting raw NLG benchmark datasets into BPE-encoded JSONL format suitable for GPT-2 fine-tuning. This process transforms human-readable structured data (tables, triples, meaning representations) and their natural language references into sequences of integer token IDs that the GPT-2 model can consume during training and evaluation.

Description

The dataset preparation pipeline operates in two sequential stages:

Stage 1: Format Conversion (Raw to JSONL)

Each NLG benchmark has its own raw format. Format conversion scripts normalize these into a uniform JSONL schema where each line is a JSON object with two fields:

context -- The structured input (meaning representation, triple set, or linearized table).
completion -- The natural language reference text.

The three supported datasets and their raw formats are:

E2E (End-to-End)

Raw format: Pipe-delimited text files (context || completion).
Splits: train.txt, valid.txt, test.txt.
Example context: name[The Eagle], eatType[coffee shop], food[French]

DART

Raw format: JSON array where each entry has a tripleset (list of [subject, relation, object] triples) and annotations (list of text references).
Splits: dart-v1.1.1-full-train.json, dart-v1.1.1-full-dev.json, dart-v1.1.1-full-test.json.
Linearization: Triples are joined as subject : relation : object | subject : relation : object.

WebNLG

Raw format: JSON dictionary with entries containing modifiedtripleset (list of {subject, property, object} dicts), lexicalisations, and category.
Splits: train.json, dev.json, test.json.
Category tracking: WebNLG tracks 10 seen categories (Airport, Astronaut, Building, City, ComicsCharacter, Food, Monument, SportsTeam, University, WrittenWork) and flags each example with a boolean cate field indicating whether the category was seen during training.

Stage 2: BPE Encoding (JSONL to Token IDs)

The second stage takes the normalized JSONL files and converts text strings into sequences of GPT-2 BPE token IDs. This uses the GPT-2 vocabulary files (encoder.json and vocab.bpe) to:

Encode the context field into a list of integer token IDs.
Encode the completion field into a list of integer token IDs, with a leading space prepended.
Optionally prepend a BOS token (ID 50256) and append an EOS token (ID 50256) -- in GPT-2, the <|endoftext|> token serves as both BOS and EOS.

The output JSONL preserves the same two-field schema but with integer lists instead of strings:

{"context": [1234, 5678, ...], "completion": [50256, 9012, ..., 50256]}

Theoretical Basis

Byte Pair Encoding (BPE) is a subword tokenization algorithm that iteratively merges the most frequent character pairs in a corpus. GPT-2 uses a BPE vocabulary of 50,257 tokens, which provides a balance between vocabulary size and sequence length. The BPE encoding step is critical because:

It converts variable-length text into fixed-vocabulary integer sequences.
Subword units handle rare and out-of-vocabulary words gracefully.
The same vocabulary must be used for encoding training data, generating predictions, and decoding outputs.

Metadata

Field	Value
Source	microsoft/LoRA
Domains	Data Processing, NLG
Type	External Tool Doc
Last Updated	2026-02-10

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment