Workflow:Microsoft LoRA GPT2 NLG Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Natural_Language_Generation, Fine_Tuning, Parameter_Efficient_Fine_Tuning |
| Last Updated | 2026-02-10 05:30 GMT |
Overview
End-to-end process for fine-tuning GPT-2 (Medium or Large) with LoRA on natural language generation benchmarks (E2E, DART, WebNLG), including data preparation, training, beam search decoding, and evaluation.
Description
This workflow covers the complete pipeline for adapting GPT-2 models to data-to-text generation tasks using Low-Rank Adaptation. It starts from raw dataset files (restaurant descriptions, structured data triples, or RDF triples), converts them into the BPE-encoded format expected by GPT-2, fine-tunes only the LoRA adapter parameters in the attention layers, generates outputs via beam search, decodes token IDs back to text, and evaluates using standard NLG metrics (BLEU, METEOR, ROUGE-L, TER). The approach trains only 0.35M-0.77M parameters compared to 354M-774M for full fine-tuning, while achieving comparable or superior performance.
Usage
Execute this workflow when you have a structured data-to-text dataset (e.g., key-value pairs describing a restaurant, RDF triples, or structured data records) and need to train GPT-2 to generate natural language descriptions from these structured inputs. This is the reference implementation for reproducing the LoRA paper results on NLG benchmarks.
Execution Steps
Step 1: Environment Setup
Set up the runtime environment with GPU support and install all required dependencies. Download the pretrained GPT-2 checkpoint and evaluation scripts.
Key considerations:
- Recommended base image: nvcr.io/nvidia/pytorch:20.03-py3
- Install system dependencies (git, jq, virtualenv) and Python packages from requirement.txt
- Download pretrained GPT-2 checkpoint (Medium: 345M params or Large: 774M params)
- Download external evaluation scripts for BLEU/METEOR/TER computation
Step 2: Dataset Preparation
Convert raw dataset files into the JSONL format expected by the training pipeline, then BPE-encode the text into token IDs. Each dataset (E2E, DART, WebNLG) has a specific format converter.
Key considerations:
- E2E format: tab-separated key-value pairs with natural language references
- DART format: JSON with structured data triples and annotated references
- WebNLG format: JSON with RDF triples and multiple reference texts
- The create_datasets.sh script orchestrates all format conversions
- BPE encoding uses the GPT-2 vocabulary (encoder.json + vocab.bpe) to produce token ID sequences
- Train/valid/test splits are maintained throughout the pipeline
Step 3: Model Configuration with LoRA
Initialize the GPT-2 model with LoRA adapters injected into the multi-head attention layers. The custom GPT-2 implementation uses loralib.MergedLinear for the combined attention QKV projection, applying LoRA only to the query and value sub-projections while keeping key projections frozen.
Key considerations:
- LoRA is applied via MergedLinear with enable_lora=[True, False, True] for the c_attn projection (Q, K, V)
- The fan_in_fan_out flag is set to True to handle GPT-2 Conv1D weight layout
- Key hyperparameters: lora_dim (rank r), lora_alpha (scaling factor), lora_dropout
- Paper defaults: r=4, alpha=32, dropout=0.1 for GPT-2 Medium on E2E
- The pretrained checkpoint is loaded with strict=False, then lora.mark_only_lora_as_trainable freezes non-LoRA params
Step 4: Distributed Training
Fine-tune the model using the PyTorch distributed training framework with an AdamW optimizer and linear learning rate schedule. The training loop processes encoded input-target pairs with optional label smoothing and gradient accumulation.
Key considerations:
- Uses torch.distributed.launch for multi-GPU training
- AdamW optimizer with configurable learning rate (0.0002 for E2E), weight decay (0.01), and beta2 (0.999)
- Linear learning rate scheduler with warmup (500 steps for E2E)
- Checkpoints are saved at regular intervals (every 1000 steps)
- Training runs for 5 epochs with gradient accumulation support
- Loss is computed using cross-entropy with optional label smoothing (0.1)
Step 5: Beam Search Generation
Generate text outputs from the fine-tuned model using beam search decoding on the test set. The beam search implementation supports length penalty, repetition penalty, and n-gram blocking.
Key considerations:
- Beam width of 10 is used for evaluation
- Length penalty (0.8) discourages excessively short or long outputs
- No-repeat n-gram size (4) prevents repetitive phrases
- The EOS token ID (628 for GPT-2) controls generation termination
- Outputs are saved as JSONL files containing token IDs
Step 6: Text Decoding
Convert beam search output token IDs back to human-readable text using the BPE decoder. Also formats reference texts from the original test set for evaluation comparison.
Key considerations:
- Uses the GPT-2 BPE vocabulary for detokenization
- Generates both prediction and reference text files
- For WebNLG and DART, multiple references per input are handled (up to 6)
- Optional tokenization and lowercasing for evaluation consistency
Step 7: Evaluation
Compute standard NLG metrics by comparing generated text against reference texts. Different evaluation scripts are used depending on the dataset.
Key considerations:
- E2E evaluation uses measure_scores.py computing BLEU, NIST, METEOR, ROUGE-L, CIDEr
- WebNLG and DART evaluation uses GenerationEval/eval.py computing BLEU, METEOR, TER
- Multiple references are supported for more robust evaluation
- Results are reported with confidence intervals across multiple random seeds