Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft LoRA DART Train Dataset

From Leeroopedia


Knowledge Sources
Domains NLG, Data_to_Text, Benchmarking
Last Updated 2026-02-10 06:00 GMT

Overview

Training split of the DART (Data-Record to Text) benchmark dataset, containing the largest collection of structured tripleset-to-text examples for training data-to-text generation models.

Description

The DART training set provides 968,029 lines of JSON data containing structured tripleset records paired with human-written text annotations. This is the largest of the three DART splits and serves as the primary training corpus for LoRA-adapted GPT-2 models. Each example consists of subject-relation-object triples from diverse sources (WikiSQL declarative sentences, WikiTableQuestions crowd annotations, and Mechanical Turk annotations) paired with natural language descriptions. The triples use uppercase relation names (e.g., LOCATION, CITY_OR_TOWN, SURFACE, MARKER_NAME) and may include special context markers like [TABLECONTEXT] and [TITLE]. The subtree_was_extended boolean tracks whether the original data subtree was augmented during dataset construction. Source annotations include WikiTableQuestions_mturk, WikiTableQuestions_lily, and WikiSQL_decl_sents, indicating the provenance of each example.

Usage

Use this dataset as the training split for fine-tuning GPT-2 with LoRA on data-to-text generation. The format_converting_dart.py script first converts this JSON into line-delimited context/completion pairs, which are then encoded with gpt2_encode.py and fed to gpt2_ft.py for parameter-efficient fine-tuning. The training process updates only low-rank adapter weights while keeping GPT-2 base parameters frozen.

Code Reference

Source Location

Data Schema

[
  {
    "tripleset": [
      ["First Clearing", "LOCATION", "On NYS 52 1 Mi. Youngsville"],
      ["On NYS 52 1 Mi. Youngsville", "CITY_OR_TOWN", "Callicoon, New York"]
    ],
    "subtree_was_extended": false,
    "annotations": [
      {
        "source": "WikiTableQuestions_mturk",
        "text": "First Clearing\tbased on Callicoon, New York and location at On NYS 52 1 Mi. Youngsville"
      }
    ]
  }
]

Format Conversion

The format_converting_dart.py script transforms each example into a context/completion pair by joining triples with " | " separators and using " : " between subject, relation, and object. Relations are lowercased during conversion:

# From format_converting_dart.py:
for i, tripleset in enumerate(example['tripleset']):
    subj, rela, obj = tripleset
    rela = rela.lower()
    if i > 0:
        temp_triples += ' | '
    temp_triples += '{} : {} : {}'.format(subj, rela, obj)

Loading

import json

with open("examples/NLG/data/dart/dart-v1.1.1-full-train.json", "r") as f:
    data = json.load(f)

I/O Contract

Inputs

Name Type Required Description
file_path str Yes Path to the JSON file (e.g., examples/NLG/data/dart/dart-v1.1.1-full-train.json)

Outputs

Name Type Description
data List[Dict] Top-level JSON array of examples, each containing tripleset, subtree_was_extended, and annotations
tripleset List[List[str]] List of subject-relation-object triples, each a 3-element string list
subtree_was_extended bool Whether the data subtree was augmented during dataset construction
annotations List[Dict] Human-written text descriptions with source attribution (keys: source, text)

Usage Examples

Loading and Inspecting DART Training Data

import json

# Load the training set
with open("examples/NLG/data/dart/dart-v1.1.1-full-train.json", "r") as f:
    train_data = json.load(f)

# Inspect first example
example = train_data[0]
print(f"Triples: {example['tripleset']}")
print(f"Extended: {example['subtree_was_extended']}")
print(f"Source: {example['annotations'][0]['source']}")
print(f"Text: {example['annotations'][0]['text']}")
print(f"Total training examples: {len(train_data)}")

# Examine annotation source distribution
from collections import Counter
sources = Counter(ann['source'] for ex in train_data for ann in ex['annotations'])
for source, count in sources.most_common():
    print(f"  {source}: {count}")

Full Training Pipeline

# Step 1: Convert training JSON to line-delimited format
python examples/NLG/src/format_converting_dart.py \
    examples/NLG/data/dart/dart-v1.1.1-full-train.json \
    examples/NLG/data/dart/dart-v1.1.1-full-train.jsonl

# Step 2: Encode for GPT-2
python examples/NLG/src/gpt2_encode.py \
    --vocab examples/NLG/vocab \
    --input_file examples/NLG/data/dart/dart-v1.1.1-full-train.jsonl \
    --output_file examples/NLG/data/dart/dart-v1.1.1-full-train.encoded

# Step 3: Fine-tune GPT-2 with LoRA
python examples/NLG/src/gpt2_ft.py \
    --train_data examples/NLG/data/dart/dart-v1.1.1-full-train.encoded \
    --valid_data examples/NLG/data/dart/dart-v1.1.1-full-dev.encoded \
    --lora_dim 4 \
    --lora_alpha 32

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment