Implementation:Microsoft LoRA PyTorch Training Loop

Knowledge Sources	Microsoft LoRA
Domains	Training, Optimization
Pattern Doc	Yes
Last Updated	2026-02-10 05:00 GMT

Overview

Pattern documentation for the standard PyTorch training loop used with LoRA-augmented models.

Description

This is a pattern doc describing the interface users must follow when training a LoRA model. Since LoRA requires no custom training logic, the pattern is simply the standard PyTorch training loop with the optimizer constructed to only receive trainable (LoRA) parameters. The Microsoft LoRA repository includes a reference implementation in the GPT-2 fine-tuning example.

Usage

Follow this pattern after completing model preparation (LoRA layer replacement via loralib layers and parameter freezing via mark_only_lora_as_trainable). Compatible with any PyTorch training framework (raw loops, PyTorch Lightning, HuggingFace Trainer, etc.).

Code Reference

Reference Implementation

Repository: microsoft/LoRA
File: examples/NLG/src/gpt2_ft.py
Lines: 171-258

Pattern Interface

Prerequisites

Model must have LoRA layers replacing target layers (via loralib.Linear, loralib.MergedLinear, etc.)
Non-LoRA parameters must be frozen (via loralib.mark_only_lora_as_trainable)
Optimizer must be constructed with parameter filtering

Core Pattern

import torch
import loralib as lora

# ===== Model Preparation (prerequisites) =====
model = create_model_with_lora_layers()
lora.mark_only_lora_as_trainable(model, bias='none')

# ===== Optimizer Setup =====
# IMPORTANT: Filter to only include trainable parameters
optimizer = torch.optim.AdamW(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=2e-4,
    weight_decay=0.01
)

# ===== Standard Training Loop =====
model.train()
for epoch in range(num_epochs):
    for batch in dataloader:
        # Forward pass
        inputs, labels = batch
        outputs = model(inputs)
        loss = loss_fn(outputs, labels)

        # Backward pass
        loss.backward()

        # Optional: gradient clipping
        torch.nn.utils.clip_grad_norm_(
            filter(lambda p: p.requires_grad, model.parameters()),
            max_norm=1.0
        )

        # Optimizer step
        optimizer.step()
        optimizer.zero_grad()

I/O Contract

Inputs

Name	Type	Required	Description
model	nn.Module	Yes	Model with LoRA layers and frozen base parameters
dataloader	DataLoader	Yes	Standard PyTorch DataLoader providing training batches
learning_rate	float	Yes	Learning rate for the optimizer (typically 1e-4 to 5e-4 for LoRA)
num_epochs	int	Yes	Number of training epochs

Outputs

Name	Type	Description
model	nn.Module	Model with updated LoRA parameters (base weights unchanged)

Usage Examples

Minimal Training Loop

import torch
import loralib as lora

# Assume model is already prepared with LoRA layers and frozen params
optimizer = torch.optim.AdamW(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=2e-4
)

model.train()
for batch in dataloader:
    loss = model(**batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

With Learning Rate Scheduler

import torch
from torch.optim.lr_scheduler import CosineAnnealingLR
import loralib as lora

optimizer = torch.optim.AdamW(
    filter(lambda p: p.requires_grad, model.parameters()),
    lr=5e-4,
    weight_decay=0.01
)
scheduler = CosineAnnealingLR(optimizer, T_max=num_steps)

model.train()
for batch in dataloader:
    loss = model(**batch).loss
    loss.backward()
    torch.nn.utils.clip_grad_norm_(
        filter(lambda p: p.requires_grad, model.parameters()),
        max_norm=1.0
    )
    optimizer.step()
    scheduler.step()
    optimizer.zero_grad()

Reference: GPT-2 Fine-Tuning (from repository)

The file examples/NLG/src/gpt2_ft.py (lines 171-258) demonstrates the full training loop pattern used in the official LoRA repository for fine-tuning GPT-2:

# Simplified from examples/NLG/src/gpt2_ft.py
optimizer = create_adam_optimizer_from_args(model, args)

for epoch in range(args.n_epochs):
    model.train()
    for batch in train_loader:
        # Forward
        output = model(batch)
        loss = output.loss

        # Backward
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), args.max_grad_norm)

        # Update
        optimizer.step()
        optimizer.zero_grad()

    # Evaluation
    model.eval()
    evaluate(model, eval_loader)

Related Pages

Implements Principle

Principle:Microsoft_LoRA_LoRA_Training_Loop

Requires Environment

Environment:Microsoft_LoRA_PyTorch_CUDA_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment