Workflow:Microsoft LoRA GPT2 NLG Finetuning

Knowledge Sources	Microsoft LoRA LoRA: Low-Rank Adaptation of Large Language Models HuggingFace GPT-2
Domains	LLMs, Natural_Language_Generation, Fine_Tuning, Parameter_Efficient_Fine_Tuning
Last Updated	2026-02-10 05:30 GMT

Overview

End-to-end process for fine-tuning GPT-2 (Medium or Large) with LoRA on natural language generation benchmarks (E2E, DART, WebNLG), including data preparation, training, beam search decoding, and evaluation.

Description

This workflow covers the complete pipeline for adapting GPT-2 models to data-to-text generation tasks using Low-Rank Adaptation. It starts from raw dataset files (restaurant descriptions, structured data triples, or RDF triples), converts them into the BPE-encoded format expected by GPT-2, fine-tunes only the LoRA adapter parameters in the attention layers, generates outputs via beam search, decodes token IDs back to text, and evaluates using standard NLG metrics (BLEU, METEOR, ROUGE-L, TER). The approach trains only 0.35M-0.77M parameters compared to 354M-774M for full fine-tuning, while achieving comparable or superior performance.

Usage

Execute this workflow when you have a structured data-to-text dataset (e.g., key-value pairs describing a restaurant, RDF triples, or structured data records) and need to train GPT-2 to generate natural language descriptions from these structured inputs. This is the reference implementation for reproducing the LoRA paper results on NLG benchmarks.

Execution Steps

Step 1: Environment Setup

Set up the runtime environment with GPU support and install all required dependencies. Download the pretrained GPT-2 checkpoint and evaluation scripts.

Key considerations:

Recommended base image: nvcr.io/nvidia/pytorch:20.03-py3
Install system dependencies (git, jq, virtualenv) and Python packages from requirement.txt
Download pretrained GPT-2 checkpoint (Medium: 345M params or Large: 774M params)
Download external evaluation scripts for BLEU/METEOR/TER computation

Step 2: Dataset Preparation

Convert raw dataset files into the JSONL format expected by the training pipeline, then BPE-encode the text into token IDs. Each dataset (E2E, DART, WebNLG) has a specific format converter.

Key considerations:

E2E format: tab-separated key-value pairs with natural language references
DART format: JSON with structured data triples and annotated references
WebNLG format: JSON with RDF triples and multiple reference texts
The create_datasets.sh script orchestrates all format conversions
BPE encoding uses the GPT-2 vocabulary (encoder.json + vocab.bpe) to produce token ID sequences
Train/valid/test splits are maintained throughout the pipeline

Step 3: Model Configuration with LoRA

Initialize the GPT-2 model with LoRA adapters injected into the multi-head attention layers. The custom GPT-2 implementation uses loralib.MergedLinear for the combined attention QKV projection, applying LoRA only to the query and value sub-projections while keeping key projections frozen.

Key considerations:

LoRA is applied via MergedLinear with enable_lora=[True, False, True] for the c_attn projection (Q, K, V)
The fan_in_fan_out flag is set to True to handle GPT-2 Conv1D weight layout
Key hyperparameters: lora_dim (rank r), lora_alpha (scaling factor), lora_dropout
Paper defaults: r=4, alpha=32, dropout=0.1 for GPT-2 Medium on E2E
The pretrained checkpoint is loaded with strict=False, then lora.mark_only_lora_as_trainable freezes non-LoRA params

Step 4: Distributed Training

Fine-tune the model using the PyTorch distributed training framework with an AdamW optimizer and linear learning rate schedule. The training loop processes encoded input-target pairs with optional label smoothing and gradient accumulation.

Key considerations:

Uses torch.distributed.launch for multi-GPU training
AdamW optimizer with configurable learning rate (0.0002 for E2E), weight decay (0.01), and beta2 (0.999)
Linear learning rate scheduler with warmup (500 steps for E2E)
Checkpoints are saved at regular intervals (every 1000 steps)
Training runs for 5 epochs with gradient accumulation support
Loss is computed using cross-entropy with optional label smoothing (0.1)

Step 5: Beam Search Generation

Generate text outputs from the fine-tuned model using beam search decoding on the test set. The beam search implementation supports length penalty, repetition penalty, and n-gram blocking.

Key considerations:

Beam width of 10 is used for evaluation
Length penalty (0.8) discourages excessively short or long outputs
No-repeat n-gram size (4) prevents repetitive phrases
The EOS token ID (628 for GPT-2) controls generation termination
Outputs are saved as JSONL files containing token IDs

Step 6: Text Decoding

Convert beam search output token IDs back to human-readable text using the BPE decoder. Also formats reference texts from the original test set for evaluation comparison.

Key considerations:

Uses the GPT-2 BPE vocabulary for detokenization
Generates both prediction and reference text files
For WebNLG and DART, multiple references per input are handled (up to 6)
Optional tokenization and lowercasing for evaluation consistency

Step 7: Evaluation

Compute standard NLG metrics by comparing generated text against reference texts. Different evaluation scripts are used depending on the dataset.

Key considerations:

E2E evaluation uses measure_scores.py computing BLEU, NIST, METEOR, ROUGE-L, CIDEr
WebNLG and DART evaluation uses GenerationEval/eval.py computing BLEU, METEOR, TER
Multiple references are supported for more robust evaluation
Results are reported with confidence intervals across multiple random seeds

Execution Diagram

GitHub URL

Workflow Repository