Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Microsoft LoRA NLG Dataset Preparation

From Leeroopedia


Overview

NLG Dataset Preparation is the principle of converting raw NLG benchmark datasets into BPE-encoded JSONL format suitable for GPT-2 fine-tuning. This process transforms human-readable structured data (tables, triples, meaning representations) and their natural language references into sequences of integer token IDs that the GPT-2 model can consume during training and evaluation.

Description

The dataset preparation pipeline operates in two sequential stages:

Stage 1: Format Conversion (Raw to JSONL)

Each NLG benchmark has its own raw format. Format conversion scripts normalize these into a uniform JSONL schema where each line is a JSON object with two fields:

  • context -- The structured input (meaning representation, triple set, or linearized table).
  • completion -- The natural language reference text.

The three supported datasets and their raw formats are:

E2E (End-to-End)

  • Raw format: Pipe-delimited text files (context || completion).
  • Splits: train.txt, valid.txt, test.txt.
  • Example context: name[The Eagle], eatType[coffee shop], food[French]

DART

  • Raw format: JSON array where each entry has a tripleset (list of [subject, relation, object] triples) and annotations (list of text references).
  • Splits: dart-v1.1.1-full-train.json, dart-v1.1.1-full-dev.json, dart-v1.1.1-full-test.json.
  • Linearization: Triples are joined as subject : relation : object | subject : relation : object.

WebNLG

  • Raw format: JSON dictionary with entries containing modifiedtripleset (list of {subject, property, object} dicts), lexicalisations, and category.
  • Splits: train.json, dev.json, test.json.
  • Category tracking: WebNLG tracks 10 seen categories (Airport, Astronaut, Building, City, ComicsCharacter, Food, Monument, SportsTeam, University, WrittenWork) and flags each example with a boolean cate field indicating whether the category was seen during training.

Stage 2: BPE Encoding (JSONL to Token IDs)

The second stage takes the normalized JSONL files and converts text strings into sequences of GPT-2 BPE token IDs. This uses the GPT-2 vocabulary files (encoder.json and vocab.bpe) to:

  • Encode the context field into a list of integer token IDs.
  • Encode the completion field into a list of integer token IDs, with a leading space prepended.
  • Optionally prepend a BOS token (ID 50256) and append an EOS token (ID 50256) -- in GPT-2, the <|endoftext|> token serves as both BOS and EOS.

The output JSONL preserves the same two-field schema but with integer lists instead of strings:

{"context": [1234, 5678, ...], "completion": [50256, 9012, ..., 50256]}

Theoretical Basis

Byte Pair Encoding (BPE) is a subword tokenization algorithm that iteratively merges the most frequent character pairs in a corpus. GPT-2 uses a BPE vocabulary of 50,257 tokens, which provides a balance between vocabulary size and sequence length. The BPE encoding step is critical because:

  • It converts variable-length text into fixed-vocabulary integer sequences.
  • Subword units handle rare and out-of-vocabulary words gracefully.
  • The same vocabulary must be used for encoding training data, generating predictions, and decoding outputs.

Metadata

Field Value
Source microsoft/LoRA
Domains Data Processing, NLG
Type External Tool Doc
Last Updated 2026-02-10

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment