Implementation:Microsoft LoRA PyTorch Training Loop
| Knowledge Sources | |
|---|---|
| Domains | Training, Optimization |
| Pattern Doc | Yes |
| Last Updated | 2026-02-10 05:00 GMT |
Overview
Pattern documentation for the standard PyTorch training loop used with LoRA-augmented models.
Description
This is a pattern doc describing the interface users must follow when training a LoRA model. Since LoRA requires no custom training logic, the pattern is simply the standard PyTorch training loop with the optimizer constructed to only receive trainable (LoRA) parameters. The Microsoft LoRA repository includes a reference implementation in the GPT-2 fine-tuning example.
Usage
Follow this pattern after completing model preparation (LoRA layer replacement via loralib layers and parameter freezing via mark_only_lora_as_trainable). Compatible with any PyTorch training framework (raw loops, PyTorch Lightning, HuggingFace Trainer, etc.).
Code Reference
Reference Implementation
- Repository: microsoft/LoRA
- File: examples/NLG/src/gpt2_ft.py
- Lines: 171-258
Pattern Interface
Prerequisites
- Model must have LoRA layers replacing target layers (via loralib.Linear, loralib.MergedLinear, etc.)
- Non-LoRA parameters must be frozen (via loralib.mark_only_lora_as_trainable)
- Optimizer must be constructed with parameter filtering
Core Pattern
import torch
import loralib as lora
# ===== Model Preparation (prerequisites) =====
model = create_model_with_lora_layers()
lora.mark_only_lora_as_trainable(model, bias='none')
# ===== Optimizer Setup =====
# IMPORTANT: Filter to only include trainable parameters
optimizer = torch.optim.AdamW(
filter(lambda p: p.requires_grad, model.parameters()),
lr=2e-4,
weight_decay=0.01
)
# ===== Standard Training Loop =====
model.train()
for epoch in range(num_epochs):
for batch in dataloader:
# Forward pass
inputs, labels = batch
outputs = model(inputs)
loss = loss_fn(outputs, labels)
# Backward pass
loss.backward()
# Optional: gradient clipping
torch.nn.utils.clip_grad_norm_(
filter(lambda p: p.requires_grad, model.parameters()),
max_norm=1.0
)
# Optimizer step
optimizer.step()
optimizer.zero_grad()
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | nn.Module | Yes | Model with LoRA layers and frozen base parameters |
| dataloader | DataLoader | Yes | Standard PyTorch DataLoader providing training batches |
| learning_rate | float | Yes | Learning rate for the optimizer (typically 1e-4 to 5e-4 for LoRA) |
| num_epochs | int | Yes | Number of training epochs |
Outputs
| Name | Type | Description |
|---|---|---|
| model | nn.Module | Model with updated LoRA parameters (base weights unchanged) |
Usage Examples
Minimal Training Loop
import torch
import loralib as lora
# Assume model is already prepared with LoRA layers and frozen params
optimizer = torch.optim.AdamW(
filter(lambda p: p.requires_grad, model.parameters()),
lr=2e-4
)
model.train()
for batch in dataloader:
loss = model(**batch).loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
With Learning Rate Scheduler
import torch
from torch.optim.lr_scheduler import CosineAnnealingLR
import loralib as lora
optimizer = torch.optim.AdamW(
filter(lambda p: p.requires_grad, model.parameters()),
lr=5e-4,
weight_decay=0.01
)
scheduler = CosineAnnealingLR(optimizer, T_max=num_steps)
model.train()
for batch in dataloader:
loss = model(**batch).loss
loss.backward()
torch.nn.utils.clip_grad_norm_(
filter(lambda p: p.requires_grad, model.parameters()),
max_norm=1.0
)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
Reference: GPT-2 Fine-Tuning (from repository)
The file examples/NLG/src/gpt2_ft.py (lines 171-258) demonstrates the full training loop pattern used in the official LoRA repository for fine-tuning GPT-2:
# Simplified from examples/NLG/src/gpt2_ft.py
optimizer = create_adam_optimizer_from_args(model, args)
for epoch in range(args.n_epochs):
model.train()
for batch in train_loader:
# Forward
output = model(batch)
loss = output.loss
# Backward
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)
# Update
optimizer.step()
optimizer.zero_grad()
# Evaluation
model.eval()
evaluate(model, eval_loader)